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Even  though  vocoders  based  on  this  underlying  speech  model 
have  been  quite  successful  in  synthesising  intelligible  speech, 
they  have  not  been  successful  in  synthesising  high  quality  speech. 
For  clean  speech,  the  synthesised  speech  often  exhibits  a  "bussy" 
quality.  For  noisy  speech,  severe  "bussiness”  and  other  degrada¬ 
tions  often  occur  resulting  in  a  large  drop  in  intelligibility 
scores.  The  poor  quality  of  the  synthesised  speech  is,  in  part, 
due  to  the  excitation  models  and  the  parameter  estimation  methods 
used  in  existing  vocoders. 

This  Technical  Report  presents  the  Multi-Band  Excitation 
Vocoder  which  contains  a  speech  model  allowing  the  band  around 
each  harmonic  of  the  fundamental  frequency  to  be  declared  voiced 
or  unvoiced.  Accurate  and  robust  estimation  methods  are  developed 
for  the  parameters  of  this  new  speech  model  and  methods  for 
synthesising  speech  from  the  model  parameters  are  described. 

Methods  for  coding  the  speech  model  parameters  are  presented  and 
an  8  Kbps  vocoder  is  developed. 

This  8  kbps  Multi-Band  Excitation  (MBE)  Vocoder  is  compared 
with  a  more  conventional  Single  Band  Excitation  (SBE)  Vocoder 
(1  V/UV  bit  per  frame)  in  terms  of  quality  and  intelligibility. 
Informal  listening  indicates  that  the  "bussy"  quality  of  the 
SBE  Vocoder  is  eliminated  by  the  MBE  Vocoder  with  the  improvement 
being  most  dramatic  in  noisy  speech.  Intelligibility  tests 
(Diagnostic  Rhyme  Tests) for  speech  corrupted  by  additive  white 
noise  (approximately  5  dB  SNR)  produced  an  average  score  of  58.0 
points  for  the  MBE  Vocoder,  12  points  better  than  the  average 
score  of  46.0  for  the  SBE  Vocoder.  In  addition,  the  average 
score  for  the  MBE  Vocoder  was  only  about  5  points  below  the  average 
DRT  score  of  63.1  for  the  uncoded  noisy  speech.  This  represents 
a  much  smaller  intelligibility  decrease  in  noise  than  experienced 
by  most  vocoders. 
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Chapter  1 


Introduction 

1.1  Problem  Description 

In  a  number  of  applications,  introduction  of  speech  models  provides  im¬ 
proved  performance.  For  example,  in  applications  such  as  bandwidth  com¬ 
pression  of  speech,  introduction  of  an  appropriate  speech  model  provides 
increased  intelligibility  at  low  bit  rates  when  compared  to  typical  direct 
coding  of  the  waveform.  The  advantage  of  introducing  a  speech  model  is 
that  the  highly  redundant  speech  waveform  is  transformed  to  model  param¬ 
eters  with  lower  bandwidth.  Examples  of  systems  based  on  an  underlying 
speech  model  (vocoders)  include  linear  prediction  vocoders,  homomorphic 
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vocoders,  and  channel  vocoders.  In  these  systems,  speech  is  modeled  on 
a  short-time  basis  as  the  response  of  a  linear  system  excited  by  a  periodic 
impulse  train  for  voiced  sounds  or  random  noise  for  unvoiced  sounds.  For 
this  class  of  vocoders,  speech  is  analyzed  by  first  segmenting  speech  using 
a  window  such  as  a  Hamming  window.  Then,  for  each  segment  of  speech, 
the  excitation  parameters  and  system  parameters  are  determined.  The  ex¬ 
citation  parameters  consist  of  the  voiced/unvoiced  decision  and  the  pitch 
period.  The  system  parameters  consist  of  the  spectral  envelope  or  the  im¬ 
pulse  response  of  the  system.  This  class  of  speech  models  is  chosen  since 
the  excitation  and  system  parameters  tend  to  vary  slowly  with  time  due  to 
physical  constraints  on  the  vocal  tract  and  its  excitation  sources.  In  order 
to  synthesize  speech,  the  excitation  parameters  are  used  to  synthesize  an 
excitation  signal  consisting  of  a  periodic  impulse  train  in  voiced  regions  or 
random  noise  in  unvoiced  regions.  This  excitation  signal  is  then  filtered 
using  the  estimated  system  parameters. 

In  addition  to  the  lower  bandwidth  of  the  model  parameters,  speech 
models  are  often  introduced  to  allow  speech  transformations  through  mod¬ 
ification  of  the  model  parameters.  For  example,  in  the  application  of  en¬ 
hancement  of  speech  spoken  in  a  helium-oxygen  mixture,  a  nonlinear  fre- 
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quency  warping  of  the  spectral  envelope  is  desired  without  modifying  the 
excitation  parameters  [28].  Introduction  of  a  speech  model  allows  separar 
tion  of  spectral  envelope  and  excitation  parameters  for  separate  processing 
which  could  not  be  directly  applied  to  the  speech  waveform. 

Even  though  vocoders  based  on  this  class  of  underlying  speech  models 
have  been  quite  successful  in  synthesizing  intelligible  speech,  they  have 
not  been  successful  in  synthesizing  high  quality  speech.  The  poor  quality 
of  the  synthesized  speech  is,  in  part,  due  to  fundamental  limitations  in 
the  speech  models  and,  in  part,  due  to  inaccurate  estimation  of  the  speech 
model  parameters.  As  a  consequence,  vocoders  have  not  been  widely  used  in 
applications  such  as  time-scale  modification  of  speech,  speech  enhancement, 
or  high  quality  bandwidth  compression. 

One  of  the  major  degradations  present  in  vocoders  employing  a  sim¬ 
ple  voiced/unvoiced  model  is  a  “buzzy"  quality  especially  noticeable  in 
regions  of  speech  which  contain  mixed  voicing  or  in  voiced  regions  of  noisy 
speech.  Observations  of  the  short-time  spectra  indicate  that  these  speech 
regions  tend  to  have  regions  of  the  spectrum  dominated  by  harmonics  of 
the  fundamental  frequency  and  other  regions  dominated  by  noise-like  en¬ 
ergy.  Since  speech  synthesized  entirely  with  a  periodic  source  exhibits  a 
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"bony"  quality  and  speech  synthesised  entirely  with  a  noise  source  ex¬ 
hibits  a  "hoarse”  quality,  it  is  postulated  that  the  perceived  "bossiness”  of 
vocoder  speech  is  due  to  replacing  noise- like  energy  in  the  original  spectrum 
with  periodic  "bussy”  energy  in  the  synthetic  spectrum.  This  occurs  since 
the  simple  voiced /unvoiced  excitation  model  produces  excitation  spectra 
consisting  entirely  of  harmonics  of  the  fundamental  (voiced)  or  noise- like 
energy  (unvoiced).  Since  this  problem  is  a  major  cause  of  quality  degra¬ 
dation  in  vocoders,  any  attempt  to  significantly  improve  vocoder  quality 
must  account  for  these  effects. 

The  degradation  in  quality  of  vocoded  noisy  speech  is  accompanied  by  a 
decrease  in  intelligibility  semes.  For  example,  Gold  and  Tierney  [7]  report  a 
DRT  score  of  71.4  (Table  1.1)  for  the  Belgard  2400  bps  vocoder  in  F15  noise 
down  18.7  points  from  a  score  of  00.1  for  the  uncoded  (5  kHs  Bandwidth,  12 
Bit  PCM)  noisy  speech.  In  clean  speech,  a  score  of  86.5  was  reported  for  the 
Belgard  vocoder,  down  only  10.3  points  from  a  score  of  06.8  for  the  uncoded 
speech.  They  call  the  additional  loss  of  8.4  points  in  this  noise  condition  the 
"aggravation  factor”  for  vocoders.  One  potential  cause  of  this  "aggravation 
factor”  is  that  vocoders  which  employ  a  single  voiced/unvoiced  decision  for 
the  entire  frequency  band  eliminate  potentially  important  acoustic  cues  for 
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Vocoder 

PlSNotoa 

Uncoded 

9M 

00.1 

Belgard:  2400  bps 

•6.5 

71.4 

Belgard:  Noise  Excitation 

86.4 

•6.3 

Table  1.1:  DRT  Seem 


Another  important  piece  of  information  in  Table  1.1  ie  that  for  clean 
speech,  the  DRT  ecore  remains  about  the  seme  when  an  all-no ieo  excitation 
ie  need  in  the  Belgard  Vocoder.  However,  for  noisy  epeech,  the  DRT  ecore 
drape  about  5  points  with  the  aU-noiee  excitation.  This  indicates  that  the 
composition  of  the  excitation  signal  can  be  important  for  intelligibility, 
especially  in  noisy  speech. 

As  will  be  discerned  in  Section  1 J,  in  previous  approaches  to  this  prob¬ 
lem  the  voiced  /  unvoiced  decisions  or  ratios  control  large  contiguous  regions 
of  the  spectrum.  These  approaches  are  too  restrictive  to  adequately  model 
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apse ch,  the  fnqwMcy  a f  these  inriiiHwn  is  creases  dnaatkiltjr  do*  to 
the  lucres as d  dURcoity  of  the  ^aach  modal  par— alar  estimation  prob¬ 
lem.  Consequently,  a  high  quality  apaadi  analysis /synthesis  syatam  must 
htM  both  aa  imp saved  apaadi  aaadal  and  sab—  matboda  for  accurately 
sat  trusting  the  op  each  modal  paramatara. 

1.2  Background 

A  numbar  of  mixed  excitation  models  have  bean  propoaad  aa  potential  solu- 
tiona  to  tha  problem  of  "busviaaas"  in  vocoders.  In  thaaa  modaia,  periodic 
and  noiaa  like  excitations  are  mixed  which  have  either  time- invariant  or 
time-varying  spectral  shapes. 

In  excitation  models  having  time-invariant  spectral  shapes,  the  excita¬ 
tion  signal  consist!  of  tha  mm  of  a  periodic  source  and  a  noise  source  with 


flued  spectral  anvslopm.  The  mixture  ratio  cattroii  the  inylhudn  of  Um 
pariodk  ad  noise  mmicm.  IwmpUs  of  rack  atdth  include  Itakvra  and 
lake  (14),  and  Keren  od  Goldberg  (li).  la  Um  ortUUon  nodal  propoaad 
by  Itakvra  and  Salto,  a  white  notea  source  te  added  to  a  white  pariodk 
source.  Tha  mixture  ratio  hotwaan  than  sources  la  aatlmated  from  the 
height  of  the  peak  of  the  autocorrelation  of  the  LPC  residual.  Seauha 
from  thia  modal  were  not  encouraging  [17).  la  one  avr Hatton  modal  im- 
piemen  ted  by  Keren  and  Goldberg,  a  white  periodic  source  and  a  white 
no tea  source  with  tha  mixture  ratio  estimated  from  the  autocorrelation  of 
the  LPC  residual  are  reported  to  produce  "slightly  muffled”  and  "hoarae” 
syatheetesA  speech. 

Tha  primary  assumption  in  them  excitation  models  is  that  tha  spectral 
shapes  of  the  pariodk  and  notea  sources  is  not  time-varying.  This  a*» 
sumption  is  often  violated  in  ckaa  speech.  For  example,  inspection  of  the 
speech  spectra  in  mixed  voicing  regions  such  as  a  typical  /a/  (Figure  1.1) 
indicates  that  low  frequencies  exhibit  primarily  pariodk  excitation  and  the 
high  frequencies  exhibit  primarily  noise- like  excitation.  However,  inspec¬ 
tion  of  speech  spectre  in  almost  completely  voiced  regions  such  as  a  typical 
/a/  (Figure  1.2)  indkate  that  a  periodic  source  with  a  nearly  flat  spactra! 
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Figure  1.1:  Spectrum  of  a  /%/  Pkaona 

envelop*  is  roqoirod.  Similarly,  speech  apoctra  in  completely  on  voiced  re¬ 
gions  sock  as  a  typical  /t/  (Figaro  1.S)  indicate  that  a  anise  Hko  source 
with  a  flat  spoctral  envelope  ie  required.  These  ohoarvatlone  indicate  that 
periodic  and  noise  sources  with  time-varying  spectral  shapes  are  required 
and  help  to  explain  the  poor  results  obtained  with  the  excitation  models 
having  time-invariant  spectral  shapes. 

In  excitation  models  having  time- varying  spectral  shapes,  the  excitation 
signal  consists  of  the  sum  of  a  periodic  source  and  a  noise  source  with 
time- varying  spectral  envelope  shapes.  Examples  of  such  models  include 
Fqjimara  (ft],  Makhoul  at  al.  (17],  and  Kwon  and  Goldberg  (lftj. 


In  the  excitation  modal  proposed  by  Fujimara,  the  excitation  spectrum 
is  dhrided  into  thrss  fixed  frequency  bends.  A  separate  cepstral  analysis 
is  performed  for  each  frequency  band  and  a  voiced/ unvoiced  decision  for 
each  frequency  band  is  made  based  on  the  height  of  the  cepetrum  peak  as 
a  measure  of  periodicity. 

In  the  excitation  model  proposed  by  Makhoul  et  al.,  the  excitation  sig¬ 
nal  consists  of  the  sum  of  a  low-pass  periodic  source  and  a  high-pass  noise 
source.  The  low-pass  periodic  source  was  generated  by  filtering  a  white 
pulse  source  with  a  variable  cut-off  filter.  Similarly,  the  high-pass  noise 
source  was  generated  by  filtering  a  white  noise  source  with  a  variable  cut¬ 
off  high-peas  filter.  The  cut-off  frequencies  for  the  two  filters  are  equal  and 
are  estimated  by  choosing  the  highest  frequency  at  which  the  spectrum  is 
periodic.  Periodicity  of  the  spectrum  is  determined  by  examining  the  sepa¬ 
ration  between  consecutive  peaks  and  determining  whether  the  separations 
are  the  same,  within  some  tolerance  level. 

In  a  second  excitation  model  implemented  by  Kwon  and  Goldberg,  a 
pulse  source  is  passed  through  a  variable  gain  low-pass  filter  and  added  to 
itself,  and  a  white  noise  source  is  passed  through  a  variable  gain  high-pass 
filter  and  added  to  iteelf.  The  excitation  signal  is  the  sum  of  the  resul- 
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taut  pulse  and  noise  sources  with  the  relative  amplitudes  controlled  by  a 
voiced /unvoiced  mixture  ratio.  The  filter  gains  and  voiced/unvoiced  mix¬ 
ture  ratio  are  estimated  from  the  LPC  residual  signal  with  the  constraint 
that  the  spectral  envelope  of  the  resultant  excitation  signal  is  flat. 

In  these  excitation  models,  the  voiced /unvoiced  decisions  or  ratios  con¬ 
trol  large  contiguous  regions  of  the  spectrum.  The  boundaries  of  these 
regions  are  usually  fixed  and  have  been  limited  to  relatively  few  (one  to 
three)  regions.  Observations  by  Fujimara  [5]  of  “devoiced”  regions  of  fre¬ 
quency  in  vowel  spectra  in  clean  speech  together  with  our  observations 
of  spectra  of  voiced  speech  corrupted  by  random  noise  argues  for  a  more 
flexible  excitation  model  than  those  previously  developed.  In  addition,  we 
hypothesise  that  humans  can  discriminate  between  frequency  regions  dom¬ 
inated  by  harmonics  of  the  fundamental  and  those  dominated  by  noise-like 
energy  and  employ  this  information  in  the  process  of  separating  voiced 
speech  from  random  noise.  Elimination  of  this  acoustic  cue  in  vocoders 
based  on  simple  excitation  models  may  help  to  explain  the  significant  in¬ 
telligibility  decrease  observed  with  these  systems  in  noise  [7].  To  account 
for  the  observed  phenomena  and  restore  potentially  useful  acoustic  infor¬ 
mation,  a  function  giving  the  voiced/unvoiced  mixture  versus  frequency  is 
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desirable. 


One  recent  approach  which  has  become  quite  popular  is  the  Multi-Pulse 
LPC  model  (1].  In  this  model,  Linear  Predictive  Coding  (LPC)  is  used  to 
model  the  spectral  envelope.  The  excitation  signal  consists  of  multiple 
pulses  per  pitch  period  instead  of  the  standard  LPC  excitation  consisting 
of  one  pulse  per  pitch  period  for  voiced  speech  or  a  white  noise  sequence 
for  unvoiced  speech.  With  this  model  the  original  signal  can  be  recovered 
by  using  one  pulse  per  sample  and  setting  the  excitation  signal  to  the  LPC 
residual  signal.  However,  coding  the  excitation  signal  for  this  case  would 
require  a  prohibitively  large  number  of  bits.  One  method  for  reducing  the 
number  of  bits  required  to  code  the  excitation  signal  is  to  allow  only  & 
small  number  of  pulses  per  pitch  period  and  then  code  the  amplitudes  and 
locations  of  these  pulses.  The  amplitudes  and  locations  of  the  pulses  are 
estimated  to  minimize  a  weighted  squared  difference  between  the  original 
Fourier  transform  and  the  synthetic  Fourier  transform.  This  estimation 
procedure  can  be  quite  expensive  computationally  since  the  error  criterion 
must  be  evaluated  for  all  possible  locations  of  each  pulse  introduced.  One 
drawback  of  this  approach  is  that  the  pulses  are  placed  to  minimize  the  fine 
structure  differences  between  the  frequency  bands  of  the  original  Fourier 
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transform  and  the  synthetic  Fourier  transform  regardless  of  whether  these 
bands  contain  periodic  or  aperiodic  energy.  It  seems  important  to  obtain 
a  good  match  to  the  fine  structure  of  the  original  spectrum  in  frequency 
bands  containing  periodic  energy.  However,  in  frequency  bands  dominated 
by  noise-like  energy,  it  seems  important  only  to  match  the  spectral  envelope 
and  not  spend  bits  on  the  fine  structure.  Consequently,  it  appears  that  a 
more  efficient  coding  scheme  would  result  from  matching  only  the  periodic 
portions  of  the  spectrum  with  pulses  and  then  coding  the  rest  as  frequency 
dependent  noise  which  can  then  be  synthesized  at  the  receiver. 

1.3  Thesis  Outline 

hi  Chapter  2,  our  new  Multi-Band  Excitation  Model  for  high  quality  mod¬ 
eling  of  clean  and  noisy  speech  is  described.  This  model  allows  a  large 
number  of  frequency  bands  to  be  declared  voiced  or  unvoiced  for  improved 
modeling  of  mixed  voicing  and  noisy  speech.  In  Chapter  3,  methods  for  es¬ 
timating  the  parameters  of  this  new  model  are  developed.  These  methods 
estimate  the  excitation  and  spectral  envelope  parameters  simultaneously  so 
that  the  synthesized  spectrum  is  closest  in  the  least  squares  sense  to  the 
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original  speech  spectrum.  This  approach  helps  avoid  the  problem  of  the 
spectral  envelope  interfering  with  pitch  period  estimation  and  the  pitch 
period  interfering  with  the  spectral  envelope  estimation.  Chapter  4  dis¬ 
cusses  methods  for  synthesizing  speech  from  these  model  parameters.  In 
Chapter  5,  we  apply  the  MBE  Model  to  the  problem  of  bit-rate  reduction 
for  speech  transmission  and  storage.  Coding  methods  for  the  MBE  Model 
parameters  are  presented  which  result  in  a  high  quality  8  kbps  vocoder. 
High  quality  8  kbps  vocoders  are  of  particular  interest  in  applications  such 
as  mobile  telephones.  The  8  kbps  MBE  Vocoder  is  then  evaluated  using  the 
results  of  informal  listening  as  a  measure  of  quality  and  Diagnostic  Rhyme 
Tests  (DRTs)  as  a  measure  of  intelligibility.  Finally,  Chapter  6  discusses 
additional  potential  applications  and  presents  some  directions  for  future 
research  for  additional  quality  improvement  and  bit-rate  reduction. 

The  objective  of  this  thesis  was  to  develop  a  better  speech  model  for 
speech  segments  containing  mixed  voicing  and  for  speech  corrupted  by 
noise.  These  speech  segments  tend  to  be  degraded  by  systems  using  exist¬ 
ing  speech  models.  These  degradations  take  the  form  of  “buzziness”  in  the 
synthesized  speech  and  a  severe  decrease  in  DRT  scores  for  noisy  speech. 
This  objective  was  met  through  development  of  the  Multi-Band  Excita- 
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tion  Model  which  allows  the  spectrum  to  be  divided  into  many  frequency 
bands,  each  of  which  may  be  declared  voiced  or  unvoiced.  When  applied 
to  the  problem  of  bit-rate  reduction,  the  MBE  Model  provided  both  qual¬ 
ity  and  intelligibility  improvements  over  a  more  conventional  Single  Band 
Excitation  (SBE)  Vocoder  (1  V/UV  bit  per  frame).  In  informal  listening, 
the  MBE  Vocoder  didn’t  have  the  “buzziness”  present  in  the  coded  speech 
synthesized  by  the  SBE  Vocoder.  An  8  kbps  speech  coding  system  was 
developed  based  on  the  MBE  Model  that  provided  a  12  point  average  DRT 
score  improvement  over  the  SBE  Vocoder  for  speech  corrupted  by  additive 
white  noise.  In  addition,  the  average  DRT  score  of  the  8kbps  MBE  Vocoder 
was  only  about  5  points  below  the  average  DRT  Bcore  of  the  uncoded  noisy 
speech. 
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Chapter  2 


Multi-Band  Spectral  Excitation 
Speech  Model 


2.1  Introduction 

In  Chapter  1,  the  need  for  a  new  speech  model  capable  of  overcoming  the 
shortcomings  of  simple  speech  models  for  mixed  voicing  or  in  voiced  regions 
of  noisy  speech  was  discussed.  In  the  following  section,  our  new  Multi-Band 
Excitation  Model  is  described  for  high  quality  modeling  of  clean  and  noisy 
speech. 
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2.2  New  Speech  Model 


Due  to  the  quasi-stationary  nature  of  a  speech  signal  «(n),  a  window  tv(n) 
is  usually  applied  to  the  speech  signal  to  focus  attention  on  a  short  time 
interval  of  approximately  10ms  -  40ms.  The  windowed  speech  segment 
sw  (n)  is  defined  by 

sw(n)  =  to(n)s(n)  (2.1) 

The  window  t u(n)  can  be  shifted  in  time  to  select  any  desired  segment  of 
the  speech  signal  s(n).  Over  a  short  time  interval,  the  Fourier  transform 
Sw(w)  of  a  windowed  speech  segment  s«(n)  can  be  modeled  as  the  product 
of  a  spectral  envelope  Hw(u)  and  an  excitation  spectrum  |2?w(u/)|. 

£w(w)  =  J?w(w)  |£wM|  (2.2) 

As  in  many  simple  speech  models,  the  spectral  envelope  |ITw(u;)  |  is  a  smoothed 
version  of  the  original  speech  spectrum  |Sw(u/)|.  The  spectral  envelope  can 
be  represented  by  linear  prediction  coefficients  [19],  cepstral  coefficients 
[25],  formant  frequencies  and  bandwidths  [29],  or  samples  of  the  original 
speech  spectrum  [3].  The  representational  form  of  the  spectral  envelope 
is  not  the  dominant  issue  in  our  new  model.  However,  the  spectral  enve¬ 
lope  must  be  represented  accurately  enough  to  prevent  degradations  in  the 
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spectral  envelope  from  dominating  quality  improvements  achieved  by  the 
addition  of  a  frequency  dependent  voked/unvoiced  mixture  function.  An 
example  of  a  spectral  envelope  derived  from  the  noisy  speech  spectrum  of 
Figure  2.1(a)  is  shown  in  Figure  2.1(b). 

The  excitation  spectrum  in  our  new  speech  model  differs  from  previ¬ 
ous  simple  models  in  one  major  respect.  In  previous  simple  models,  the 
excitation  spectrum  is  totally  specified  by  the  fundamental  frequency  u>o 
and  a  voiced/ unvoiced  decision  for  the  entire  spectrum.  In  our  new  model, 
the  excitation  spectrum  is  specified  by  the  fundamental  frequency  and  a 
frequency  dependent  voiced/unvoiced  mixture  function.  In  general,  a  con¬ 
tinuously  varying  frequency  dependent  voiced/unvoiced  mixture  function 
would  require  a  large  number  of  parameters  to  represent  it  accurately.  The 
addition  of  a  large  number  of  parameters  would  severely  decrease  the  util¬ 
ity  of  this  model  in  such  applications  as  bit-rate  reduction.  To  reduce  this 
problem,  the  frequency  dependent  voiced/unvoiced  mixture  function  has 
been  restricted  to  a  frequency  dependent  binary  voiced/unvoiced  decision. 
To  further  reduce  the  number  of  these  binary  parameters,  the  spectrum 
is  divided  into  multiple  frequency  bands  and  a  binary  voiced/ unvoiced  pa¬ 
rameter  is  allocated  to  each  band.  This  new  model  differs  from  previous 
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Fig.  2.1(f)  -  Excitation  Spectrum 


Fig.  2.1(e)  -  Noiae  Spectrum 


Fig.  2.1(g)  -  Synthetic  Spectrum 


Figure  2.1:  Multi-Band  Excitation  Model  -  Noiey  Speech 


•r^uim  in  that  the  spectrum  ■  divided  into  a  large  number  of  frequency 
bands  (typically  twenty  or  more)  whereas  previous  models  used  three  fre¬ 
quency  bands  at  most  (S|.  Due  to  tbs  dhrWon  of  the  spectrum  into  multiple 
frequency  bands  with  a  Unary  voiced/unvoiced  parameter  for  each  band, 
m  have  termed  this  new  model  the  Multi- Band  Excitation  Model. 

“phe  excitation  spectrum  |  Ew  (us)  |  is  obtained  from  the  fundamental  fre¬ 
quency  u\j  and  the  voiced/ unvoiced  parameters  by  combining  segments  of 
a  periodic  spectrum  |P*(w)|  in  the  frequency  bands  declared  voiced  with 
ugunts  of  a  random  noise  spectrum  in  the  frequency  bands  declared  un¬ 
voiced.  The  periodic  spectrum  |fv(w)|  is  completely  determined  by  b*>.  One 
method  for  generating  the  periodic  spectrum  |Pw(u/)|  is  to  take  the  Fourier 
transform  magnitude  of  a  windowed  impulse  train  with  pitch  period  P .  In 
another  method,  the  Fourier  transform  of  the  window  is  centered  around 
harmonic  of  the  fundamental  frequency  and  summed  to  produce  the 
periodic  spectrum.  An  example  of  |Pw(w)|  corresponding  to  w0  =  .045* 
is  shown  in  Figure  2.1(c).  The  V/UV  information  allows  us  to  mix  the 
periodic  spectrum  with  a  random  noise  spectrum  in  the  frequency  domain 
in  a  frequency-dependent  manner  in  representing  the  excitation  spectrum. 

The  Multi-Band  Excitation  Model  allows  noisy  regions  of  the  excitation 
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spectrum  to  b«  synthesised  with  1  V/UV  bit  per  frequency  bend.  This  is 
e  distinct  advantage  over  simple  hsrmonk  models  in  coding  systems  [21] 
where  noisy  regions  ere  synthesised  from  the  coded  phsse  requiring  sround 
4  or  5  bits  per  harmonic.  In  addition,  when  the  pitch  period  becomes  small 
with  respect  to  the  window  length,  noisy  regions  of  the  excitation  spectrum 
can  no  longer  be  well  approximated  with  a  simple  harmonic  model. 

An  example  of  V/UV  information  is  displayed  in  Figure  2.1(d)  with 
a  high  value  corresponding  to  a  voiced  decision.  An  example  of  a  typical 
random  noise  spectrum  used  is  shown  in  Figure  2.1(e).  The  excitation  spec¬ 
trum  |2?w(u)|  derived  from  |5«(u;)  j  in  Figure  2.1(a)  using  the  above  proce¬ 
dure  is  shown  in  Figure  2.1(f).  The  spectral  envelope  |27»(u)|  is  represented 
by  one  sample  |Am|  for  each  harmonic  of  the  fundamental  in  both  voiced 
and  unvoiced  regions  to  reduce  the  number  of  parameters.  When  a  densely 
sampled  version  of  the  spectral  envelope  is  required,  it  can  be  obtained 
by  linearly  interpolating  between  samples.  The  synthetic  speech  spectrum 
|sw(u/)|  obtained  by  multiplying  |i?w(u/)|  in  Figure  2.1(f)  by  |i?w(w)|  in 
Figure  2.1(b)  is  shown  in  Figure  2.1(g). 

Additional  examples  of  voiced,  unvoiced,  and  mixed  voicing  segments 
of  clean  speech  are  shown  in  Figures  2.2  -  2.4.  For  voiced  speech  segments 
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(Figure  2.2),  most  of  the  spectrum  ■  declared  voiced.  For  unvoiced  speech 
segments  (Figure  2.3),  most  of  the  spectrum  is  declared  unvoiced.  For 
speech  segments  containing  mixed  voicing  (Figure  2.4),  regions  contain¬ 
ing  periodic  energy  (harmonics  of  the  fundamental  frequency)  are  marked 
voiced  and  regions  containing  noise- like  energy  are  marked  unvoiced. 

Based  on  the  examples  of  Figures  2.1  -  2.4,  it  can  be  seen  that  some 
regions  of  the  speech  spectrum  are  dominated  by  harmonics  of  the  funda¬ 
mental  frequency  while  others  are  dominated  by  noise-like  energy  depending 
on  noise  and  speech  production  conditions.  To  account  for  this  observed 
behavior,  frequency  bands  with  widths  as  small  as  the  fundamental  fre¬ 
quency  should  be  individually  declared  voiced  or  unvoiced.  This  was  the 
motivation  for  the  Multi-Band  Excitation  Model. 

It  is  possible  [9]  to  synthesize  high  quality  speech  from  the  synthetic 
speech  spectrum  |&v(u;)|.  To  use  the  above  model  for  the  purpose  of  devel¬ 
oping  a  real  time  mid-rate  speech  coding  system,  however,  it  is  desirable  to 
introduce  one  additional  set  of  parameters  in  our  model.  Specifically,  the 
algorithm  [8j  that  we  have  developed  to  synthesize  speech  from  |Sw(u/)|  is  an 
iterative  procedure  that  estimates  the  phase  of  Sw(u/)  from  |Sw(u/)|  and  then 
synthesises  speech  from  |£w(u/)|  and  the  estimated  phase  of  Sw(us).  This 
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Fig.  2.2(c)  -  Periodic  Spectrum  Fig.  2.2(d)  -  V/UV  Information 


Fig.  2.2(e)  -  Noise  Spectrum  Fig.  2.2(f)  -  Excitation  Spectrum 


Fig.  2.2(g)  -  Synthetic  Spectrum 


Figure  2.2:  Multi-Band  Excitation  Model  -  Voiced  Speech 


Fig.  2.3(e)  -  Noise  Spectrum 


Fig.  2.3(f)  -  Excitation  Spectrum 


Fig.  2.3(g)  -  Synthetic  Spectrum 


Figure  2.3:  Multi-Band  Excitation  Model  -  Unvoiced  Speech 


Fig.  2.4(e)  -  Noise  Spectrum  Fig.  2.4(f)  -  Excitation  Spectrum 


Fig.  2.4(g)  -  Synthetic  Spectrum 

Figure  2.4:  Multi-Band  Excitation  Model  -  Mixed  Voicing 


algorithm  requires  a  delay  of  more  than  one  second  and  a  fairly  accurate 
representation  of  |sw(u/)|.  In  applications  such  as  time  scale  modification 
of  speech  where  these  limitations  are  not  serious  and  determining  the  de¬ 
sired  phase  of  £w(o>)  is  not  easy,  the  algorithm  that  synthesizes  speech 
from  |5w(u>)|  has  been  successfully  applied.  In  applications  such  as  real 
time  speech  coding,  however,  a  delay  of  more  than  one  second  may  not 
be  acceptable  and  furthermore,  the  desired  phase  of  &w(u)  can  be  deter¬ 
mined  straightforwardly.  Due  to  the  above  considerations,  we  introduce  an 
additional  set  of  model  parameters,  namely,  the  phase  of  each  harmonic 
declared  voiced.  We  have  chosen  to  include  the  phase  in  the  samples  of  the 
spectral  envelope  Am  rather  than  the  excitation  spectrum  |jEw(w)|  for  later 
notational  convenience. 

The  sets  of  parameters  that  we  use  in  our  model,  then,  are  the  spec¬ 
tral  envelope,  the  fundamental  frequency,  the  V/UV  information  for  each 
harmonic,  and  the  phase  of  each  harmonic  declared  voiced.  The  phases  of 
harmonics  in  frequency  bands  declared  unvoiced  are  not  included  since  they 
are  not  required  by  the  synthesis  algorithm.  From  these  sets  of  parameters, 
speech  can  be  synthesized  with  little  delay  and  significant  computational 
savings  relative  to  synthesizing  speech  from  |sw(u/)|  alone.  The  synthesis 
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of  speech  from  these  model  parameters  is  discussed  in  Chapter  4. 
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Chapter  3 

Speech  Analysis 

* 

3.1  Introduction 

In  Chapter  2,  the  Multi-Band  Excitation  Speech  Model  was  introduced. 
The  parameters  of  our  model  are  the  spectral  envelope,  the  fundamental 
frequency,  V/UV  information  for  each  harmonic,  and  the  phase  of  each 
harmonic  declared  voiced.  To  obtain  high  quality  reproduction  of  both 
clean  and  noisy  speech,  accurate  and  robust  methods  for  estimating  these 
parameters  must  be  developed.  In  the  next  section,  existing  methods  for 
estimating  the  spectral  envelope  and  fundamental  frequency  are  discussed. 
The  inadequacies  of  these  existing  techniques  led  to  the  development  of  an 
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integrated  method  (Section  3.3)  for  estimating  the  model  parameters  so 
that  the  difference  between  the  synthetic  spectrum  and  the  original  spec¬ 
trum  is  minimized.  Obtaining  an  initial  fundamental  frequency  rising  this 
method  can  be  quite  expensive  computationally.  An  alternative  formula¬ 
tion  in  Section  3.4  is  used  to  substantially  reduce  the  computation  required 
to  obtain  the  initial  fundamental  frequency  estimate  to  the  order  of  an 
autocorrelation  pitch  detection  method. 

In  Section  3.5,  we  calculate  the  fundamental  frequency  bias  associated 
with  minimizing  the  least-squares  error  criterion  for  a  periodic  signal  in 
noise.  We  then  normalize  the  error  criterion  by  the  calculated  bias  to  pro¬ 
duce  an  unbiased  error  criterion.  This  unbiased  error  criterion  significantly 
improves  the  system  performance  for  noisy  speech. 

In  Section  3.6,  the  required  pitch  period  (or  fundamental  frequency) 
accuracy  is  determined  for  accurate  estimation  of  the  voiced/unvoiced  in¬ 
formation  in  the  Multi-Band  Excitation  Model.  An  efficient  procedure  for 
obtaining  this  accuracy  based  on  the  earlier  sections  of  this  chapter  is  then 
described. 

Finally,  in  Section  3.7,  a  flowchart  of  the  complete  analysis  algorithm  is 


presented  and  discussed. 


3.2  Background 


In  previous  approaches,  the  algorithms  for  estimation  of  excitation  parame¬ 
ters  and  estimation  of  spectral  envelope  parameters  operate  independently. 
These  parameters  are  usually  estimated  based  on  some  reasonable  but 
heuristic  criterion  without  explicit  consideration  of  how  close  the  synthe¬ 
sized  speech  will  be  to  the  original  speech.  This  can  result  in  a  synthetic 
spectrum  quite  different  from  the  original  spectrum. 

Previous  approaches  to  spectral  envelope  estimation  include  Linear  Pre¬ 
diction  [19]  (All-Pole  Modeling),  windowing  the  cepstrum  [25]  (smoothing 
the  log  magnitude  spectrum),  and  windowing  the  autocorrelation  function 
[2]  (smoothing  the  magnitude  squared  spectrum).  In  these  approaches, 
the  pitch  period  often  interferes  with  the  spectral  envelope  estimation  pro¬ 
cedure.  For  example,  for  speech  frames  with  short  pitch  periods,  widely 
separated  harmonics  in  the  spectrum  tend  to  cause  pole  locations  and  band- 
widths  to  be  poorly  estimated  in  the  Linear  Prediction  method.  Methods 
that  window  the  cepstrum  or  autocorrelation  function  obtain  a  poor  enve¬ 
lope  estimate  for  short  pitch  periods  due  to  interference  of  the  peak  at  the 
pitch  period  with  the  spectral  envelope  information  present  in  the  low  time 
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portions  of  these  signals. 

Previous  approaches  to  pitch  period  estimation  include  the  Gold-Rabiner 
parallel  processing  method  [6],  choosing  the  minimum  of  the  average  mag¬ 
nitude  difference  function  (AMDF)  [30],  choosing  the  peak  of  the  autocor¬ 
relation  of  the  Linear  Prediction  residual  signal  (SIFT)  [18],  choosing  the 
peak  of  the  cepstrum  [24],  and  choosing  the  peak  of  the  autocorrelation 
function  [27].  In  these  approaches,  the  spectral  envelope  often  interferes 
with  the  pitch  period  estimation  procedure.  For  example,  methods  that 
choose  the  peak  of  the  cepstrum  or  autocorrelation  function  often  obtain 
a  poor  pitch  period  estimate  for  short  pitch  periods  due  to  interference  of 
the  spectral  envelope  information  present  in  the  low-time  portions  of  these 
signals  with  the  pitch  period  peak.  Ross  et  al.  [30]  remark  in  their  descrip¬ 
tion  of  the  AMDF  pitch  detector  that  the  limiting  factor  on  accuracy  is  the 
inability  to  completely  separate  the  fine  structure  from  the  effects  of  the 
spectral  envelope. 

In  one  technique  for  compensating  for  the  spectral  envelope  before  pitch 
detection  (SIFT),  a  spectral  envelope  estimate  (produced  by  Linear  Pre¬ 
diction)  is  divided  out  of  the  spectrum  (inverse  filtering).  In  this  approach, 
the  spectrum  is  “whiten*.  P*  in  an  attempt  to  reduce  the  effects  of  the  spec- 
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tr&l  envelope  on  pitch  period  estimation.  However,  this  technique  boosts 
low  energy  regions  of  the  spectrum  which  tend  to  be  dominated  by  noise¬ 
like  energy  which  reduces  the  periodic  signal  to  noise  ratio.  Consequently, 
although  performance  is  improved  by  reducing  the  effects  of  the  spectral 
envelope,  performance  is  degraded  by  the  reduction  in  the  periodic  signal 
to  noise  ratio. 

In  our  approach,  the  excitation  and  spectral  envelope  parameters  are 
estimated  simultaneously  so  that  the  synthesized  spectrum  is  closest  in  the 
least  squares  sense  to  the  spectrum  of  the  original  speech.  This  approach 
can  be  viewed  as  an  “analysis  by  synthesis”  method  [27]. 

3.3  Estimation  of  Speech  Model  Parameters 

Estimation  of  all  of  the  speech  model  parameters  simultaneously  would 
be  a  computationally  prohibitive  problem.  Consequently,  the  estimation 
process  has  been  divided  into  two  major  steps.  In  the  first  step,  the  pitch 
period  and  spectral  envelope  parameters  are  estimated  to  minimize  the 
error  between  the  original  spectrum  |5w(u/)|  and  the  synthetic  spectrum 
|Sw(w)|.  Then,  the  V/UV  decisions  are  made  based  on  the  closeness  of  fit 


between  the  original  end  the  synthetic  spectrum  at  each  harmonic  of  the 
estimated  fundamental. 

The  parameters  of  our  speech  model  can  be  estimated  by  minimising 
the  following  error  criterion: 

t  =  i/^C(») [IS.HI  -  |S.(u.)|]*dw  (3.1) 

where 

|&(w)|-MM»)|UMw)|  («) 

and  G(u/)  is  a  frequency  dependent  weighting  function.  This  error  criterion 
was  chosen  since  it  performed  well  in  our  previous  work  [8].  In  addition, 
this  error  criterion  yields  fairly  simple  expressions  for  the  optimal  estimates 
of  the  samples  lA^  of  the  spectral  envelope  |£fw(w)|.  Other  error  criteria 
could  also  be  used.  For  example,  the  error  criterion: 

t  =  ^ />)  | Sw[u)  -  $.(w)| (3.3) 

can  be  used  to  estimate  both  the  magnitude  and  phase  of  the  samples  Am  of 
the  spectral  envelope.  These  envelope  samples  are  the  magnitudes  (Equa¬ 
tion  (3.1))  or  magnitudes  and  phases  (Equation  (3.3))  of  the  harmonics 
for  frequency  bands  declared  voiced.  These  samples  of  the  envelope  are 
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sufficient  for  synthesizing  speech  in  the  voiced  frequency  bends  using  the 
algorithm  described  in  Chapter  4.  For  frequency  bands  declared  unvoiced, 
one  sample  of  the  spectral  envelope  per  harmonic  of  the  estimated  fun¬ 
damental  is  also  used.  This  sample  is  obtained  by  sampling  a  smoothed 
version  of  the  original  spectrum  |5w(w)|.  During  synthesis,  additional  sam¬ 
ples  of  the  spectral  envelope  in  unvoiced  regions  are  required.  These  are 
obtained  by  linearly  interpolating  between  the  estimated  samples  in  the 
magnitude  domain. 

3.3.1  Estimation  of  Pitch  Period  and  Spectral  Enve¬ 
lope 

The  objective  is  to  choose  the  pitch  period  and  spectral  envelope  param¬ 
eters  to  minimize  the  error  of  Equation  (3.1).  In  general,  minimising  this 
error  over  all  parameters  simultaneously  is  a  difficult  and  computationally 
expensive  problem.  However,  we  note  that  for  a  given  pitch  period,  the  best 
spectral  envelope  parameters  can  be  easily  estimated.  To  show  this,  we  di¬ 
vide  the  spectrum  into  frequency  bands  centered  around  each  harmonic  of 
the  fundamental  frequency.  For  simplicity,  we  will  model  the  spectral  enve- 
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lope  as  constant  in  this  interval  with  a  value  of  A*.  This  allows  the  error 
criterion  of  Equation  (3.1)  in  the  interval  around  the  m**  harmonic  to  be 
written  as: 


=  ~  I"4*!  l-EwMIl’du/  (3.4) 


where  the  interval  [<^,6*]  is  an  interval  with  a  width  of  the  fundamental 
frequency  centered  on  the  m,k  harmonic  of  the  fundamental.  The  error  £m 
is  minimized  at: 


**  crania  Mi’** 


(3.5) 


The  corresponding  estimate  of  Am  based  on  the  error  criterion  of  Equation 
(3.3)  is: 

_  G(u;)Sw(u;),g*  (u;)<&j 


i  _ 

CrCMIS.MI’Ar 


(3.6) 


At  this  point,  we  could  obtain  estimates  of  the  envelope  parameters  Am 
from  Equation  (3.5)  or  Equation  (3.6)  if  we  knew  whether  this  frequency 
band  was  voiced/unvoiced.  If  the  frequency  band  contains  primarily  peri¬ 
odic  energy,  there  will  be  energy  centered  at  the  harmonic  of  the  fundamen¬ 
tal  with  the  characteristic  window  frequency  response  shape.  Consequently, 
if  the  periodic  spectrum  [/^(w))  is  used  as  the  excitation  spectrum  |£w(w)| 
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in  this  band  a  good  match  will  be  obtained.  If  the  frequency  band  contains 
primarily  aperiodic  energy,  there  will  be  no  characteristic  shape.  Aperiodic 
energy  in  the  frequency  band  is  perhaps  best  characterised  by  a  lack  of  a 
good  match  when  the  periodic  spectrum  jPw(u»)|  is  used  as  the  excitation 
spectrum.  Thus,  by  using  |Pw(u/)|  as  the  excitation  spectrum  at  this  point, 
the  voiced/unvoiced  (periodic/aperiodic)  decision  can  be  made  based  on  the 
modeling  error  in  this  frequency  band.  After  making  the  voiced /unvoiced 
decision  the  appropriate  spectral  envelope  parameter  estimate  can  be  se¬ 
lected.  For  a  voiced  frequency  band,  the  following  estimates  are  obtained 
by  substituting  |Pw(w)|  for  |£,(w)|  in  Equation  (3.5)  and  Equation  (3.6) 


(3.7) 

G(u)S.(u)P:(u)<L, 

(3.8) 

An  efficient  method  for  obtaining  a  good  approximation  for  the  periodic 
transform  Pw(u»)  in  this  interval  is  to  precompute  samples  of  the  Fourier 
transform  of  the  window  tv(n)  and  center  it  around  the  harmonic  frequency 
associated  with  this  interval. 

For  an  unvoiced  frequency  band,  we  model  the  excitation  spectrum  as 
idealized  white  noise  (unity  across  the  band)  which  yields  the  following 
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estimate: 


j£G(uQ|g.HI<fa. 

£  G{u)du 

This  estimate  reduces  to  the  average  of  the  original  spectrum  in  the  fre¬ 
quency  band  when  the  weighting  function  G(w)  is  constant  across  the  band. 
Since  the  unvoiced  spectral  envelope  parameters  are  not  used  in  pitch  pe¬ 
riod  estimation,  they  only  need  to  be  computed  after  the  final  pitch  period 
estimate  is  determined. 

For  adjacent  intervals,  the  minimum  error  for  entirely  periodic  excita¬ 
tion  £  for  the  given  pitch  period  is  then  computed  as: 

£«££»  (3.10) 

m 

where  £m  is  £m  in  Equation  (3.4)  evaluated  with  the  |Am|  of  Equation  (3.7). 
In  this  manner,  the  spectral  envelope  parameters  which  minimize  the  error 
£  can  be  computed  for  a  given  pitch  period  P.  This  reduces  the  original 
multi-dimensional  problem  to  the  one-dimensional  problem  of  finding  the 
pitch  period  P  that  minimizes  £ . 

Experimentally,  the  error  £  tends  to  vary  slowly  with  the  pitch  pe¬ 
riod  P.  This  allows  an  initial  estimate  of  the  pitch  period  near  the  global 
minimum  to  be  obtained  by  evaluating  the  error  on  a  coarse  grid.  In  prac- 
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tice,  the  initial  estimate  is  obtained  by  evaluating  the  error  £  for  integer 
pitch  periods.  In  this  initial  coarse  estimation  of  the  pitch  period,  the  high 
frequency  harmonics  cannot  be  well  matched  so  the  frequency  weighting 
function  G(u;)  is  chosen  to  de-emphasize  high  frequencies. 

If  the  pitch  period  of  the  original  speech  segment  is  40  samples,  the  as¬ 
sociated  normalized  fundamental  frequency  is  .025.  We  define  normalized 
frequency  as  the  actual  analog  frequency  divided  by  the  sampling  frequency 
so  that  the  normalized  fundamental  frequency  is  just  the  reciprocal  of  the 
pitch  period  in  samples.  Integer  multiples  of  the  correct  pitch  period  (80, 
120,  ...)  will  have  fundamental  frequencies  at  integer  submultiples  of  the 
correct  fundamental  frequency  (.0125,  .00833, ...).  Every  ntk  (second,  third, 
...)  harmonic  of  the  ntk  submultiple  (.0125,  .00833,  ...)  of  the  correct  pitch 
period  will  lie  at  the  frequency  of  one  of  the  harmonics  of  the  correct  fun¬ 
damental  frequency.  For  example,  Figure  3.1  shows  the  periodic  spectrum 
j  f°r  pitch  periods  of  40  and  80  samples.  Since  every  second  harmonic 
of  a  fundamental  frequency  of  .0125  are  at  the  harmonics  of  a  fundamental 
frequency  of  .025,  the  error  £  will  be  comparable  for  the  correct  pitch  pe¬ 
riod  and  its  integer  multiples.  Consequently,  once  the  pitch  period  which 
minimizes  £  is  found,  the  errors  at  submultiples  of  this  pitch  period  are 
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compared  to  the  minimum  error  and  the  smallest  pitch  period  with  com¬ 
parable  error  is  chosen  as  the  pitch  period  estimate.  This  feature  can  be 
used  to  reduce  computation  by  limiting  the  initial  range  of  P  over  which 
the  error  i  is  computed  to  long  pitch  periods. 

To  accurately  estimate  the  voiced /unvoiced  decisions  in  high  frequency 
bands,  pitch  period  estimates  more  accurate  than  the  closest  integer  value 
are  required  (See  Section  3.6).  More  accurate  pitch  period  estimates  can 
be  obtained  by  using  the  best  integer  pitch  period  estimate  chosen  above  as 
an  initial  coarse  pitch  period  estimate.  Then,  the  error  is  minimised  locally 
to  this  estimate  by  using  successively  finer  evaluation  grids  and  a  frequency 
weighting  function  G(u)  which  includes  high  frequencies.  The  final  pitch 
period  estimate  is  chosen  as  the  pitch  period  which  produces  the  minimum 
error  in  this  local  minimization.  The  pitch  period  accuracies  that  can  be 
obtained  using  this  method  are  given  in  Section  3.6. 

To  obtain  the  maximum  sensitivity  to  regions  of  the  spectrum  contain¬ 
ing  pitch  harmonics  when  large  regions  of  the  spectrum  contain  noise-like 
energy,  the  expected  value  of  the  error  t  should  not  vary  with  the  pitch  pe¬ 
riod  for  a  spectrum  consisting  entirely  of  noise-like  energy.  However,  since 
the  spectral  envelope  is  sampled  more  densely  for  longer  pitch  periods,  the 
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Figure  3.1(a)  -  Periodic  Spectrum  (Period=40) 


Figure  3.1(b)  -  Periodic  Spectrum  (Period=80) 


Figure  3.1(c)  -  Overlayed  Periodic  Spectra  (Periods=40  and  80) 


Figure  3.1:  Pitch  Period  Doubling 
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expected  error  is  smaller  for  longer  pitch  periods.  This  bias  towards  longer 
pitch  periods  is  calculated  in  Section  3.5  and  an  unbiased  error  criterion 
is  developed  by  multiplying  the  error  £  by  a  pitch  period  dependent  cor¬ 
rection  factor.  This  correction  factor  is  applied  to  the  error  £  in  Equation 
(3.10)  prior  to  minimizing  over  the  pitch  period. 

To  illustrate  our  new  approach,  a  specific  example  will  be  considered. 
In  Figure  3.2(a),  256  samples  of  female  speech  sampled  at  10  kHz  are  dis¬ 
played.  This  speech  segment  was  windowed  with  a  256  point  Hamming 
window  and  an  FFT  was  used  to  compute  samples  of  the  spectrum  |SW  HI 
shown  in  Figure  3.2(b).  We  use  the  property  that  the  Fourier  transform 
of  a  real  sequence  is  conjugate  symmetric  [26]  in  order  to  compute  these 
samples  of  the  spectrum  with  a  256  point  complex  FFT.  From  the  FFT, 
255  complex  points  (samples  of  the  Fourier  Transform  between  normalized 
frequencies  of  0  and  .5)  and  2  real  points  (at  normalized  frequencies  of  0 
and  .5)  are  obtained.  After  the  magnitude  operation,  there  are  257  real 
samples  of  the  spectrum  between  and  including  normalized  frequencies  of 
0  and  .5.  Figure  3.2(c)  shows  the  error  £  as  a  function  of  P  with  G(u)  =  1 
for  frequencies  less  than  2  kHz  and  G(w)  =  0  for  frequencies  greater  than 
2  kHz.  The  error  E  is  smallest  for  P  =  85,  but  since  the  error  for  the  sub- 
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Fig.  3.2(b)  -  Original  Spectrum  Fig.  3.2(e)  •  Original  and  Synthetic 

(Integer  P) 


Fig.  3.2(c)  -  Error  vs.  Pitch  Period 

Figure  3.2:  Estimation  of  Model  Parameters 


multiple  at  P  =  42.5  is  comparable,  the  initial  estimate  of  the  pitch  period 
is  chosen  as  42.5  samples.  If  an  integer  pitch  period  estimate  is  desired, 
the  error  is  evaluated  at  pitch  periods  of  42  and  43  samples  and  the  integer 
pitch  period  estimate  is  chosen  as  the  pitch  period  with  the  smaller  error. 
If  non-integer  pitch  periods  are  desired,  the  error  i  is  minimized  around 
this  initial  estimate  with  G(w)  chosen  to  include  the  high  frequencies.  A 
typical  weighting  function  G(w)  which  we  have  used  in  practice  is  unity 
from  0  to  5  kHz.  Figure  3.2(d)  shows  the  original  spectrum  overlayed  with 
the  synthetic  spectrum  for  the  final  pitch  period  estimate  of  42.48  sam¬ 
ples.  For  comparison,  Figure  3.2(e)  shows  the  original  spectrum  overlayed 
with  the  synthetic  spectrum  for  the  best  integer  pitch  period  estimate  of 
42  samples.  This  figure  demonstrates  the  mismatch  of  the  high  harmonics 
obtained  if  only  integer  pitch  periods  are  allowed. 

Pitch  track  models  can  also  be  incorporated  in  this  analysis  system.  For 
example,  if  the  pitch  period  is  not  expected  to  change  very  much  from  one 
frame  to  the  next,  the  error  criterion  can  be  biased  to  prefer  pitch  period 
estimates  around  the  estimate  for  the  previous  frame.  A  pitch  track  model 
can  also  be  used  to  reduce  computation  by  constraining  the  possible  pitch 
periods  to  a  smaller  region.  In  regions  of  speech  where  the  normalized  error 
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obtained  by  the  best  pitch  period  estimate  is  small,  the  periodic  synthetic 
spectrum  matches  the  original  spectrum  well  and  we  cam  be  relatively  cer¬ 
tain  that  the  pitch  period  estimate  in  these  regions  is  correct.  The  pitch 
track  can  then  be  extrapolated  from  such  regions  with  our  analysis  method 
with  the  pitch  track  model  incorporated. 

Many  pitch  tracking  methods  employ  a  smoothing  approach  to  reduce 
gross  pitch  errors.  One  problem  with  these  techniques  is  that  in  the  smooth¬ 
ing  process,  the  accuracy  of  the  pitch  period  estimate  is  degraded  even  for 
clean  speech.  One  pitch  tracking  method  which  we  have  found  particularly 
useful  in  practice  for  obtaining  accurate  estimates  in  clean  speech  and  re¬ 
ducing  gross  pitch  errors  under  very  low  periodic  signal  to  noise  ratios  is 
based  on  a  dynamic  programming  approach.  There  are  three  pitch  track 
conditions  to  consider:  1)  the  pitch  track  starts  in  the  current  frame,  2) 
the  pitch  track  terminates  in  the  current  frame,  and  3)  the  pitch  track  con¬ 
tinues  through  the  current  frame.  We  have  found  that  the  third  condition 
is  adequately  modeled  by  one  of  the  first  two.  We  wish  to  find  the  best 
pitch  track  starting  or  terminating  in  the  current  frame.  We  will  look  for¬ 
ward  and  backward  N  frames  where  N  is  small  enough  that  insignificant 
delay  is  encountered  (N  =  3  corresponding  to  60ms  is  typical).  The  al- 


lowable  frame-to-frame  pitch  period  deviation  is  set  to  D  samples  [D  =  2 
is  typical).  We  then  find  the  minimum  error  paths  from  N  frames  in  the 
past  to  the  current  frame  and  from  N  frames  in  the  future  to  the  current 
frame.  We  then  determine  which  of  these  paths  has  the  smallest  error  and 
the  initial  pitch  period  estimate  is  chosen  as  the  pitch  period  in  the  cur¬ 
rent  frame  in  which  this  smallest  error  path  terminates.  The  error  along 
a  path  is  determined  by  summing  the  errors  at  each  pitch  period  through 
which  the  path  passes.  Dynamic  programming  techniques  [22]  are  used  to 
significantly  reduce  the  computational  requirements  of  this  procedure. 


3.3.2  Estimation  of  V /UV  Information 


The  voiced/unvoiced  decision  for  each  harmonic  is  made  by  comparing  the 
normalized  error  over  each  harmonic  of  the  estimated  fundamental  to  a 
threshold.  When  the  normalized  error  over  the  mth  harmonic 


(3.11) 


is  below  the  threshold,  this  region  of  the  spectrum  matches  that  of  a  pe¬ 
riodic  spectrum  well  and  the  mttl  harmonic  is  marked  voiced.  When 


is  above  the  threshold,  this  region  of  the  spectrum  is  assumed  to  contain 
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noise-like  energy.  After  the  voiced/unvoiced  decision  is  made  for  each  fre¬ 
quency  band  the  voiced  or  unvoiced  spectral  envelope  parameter  estimates 
are  selected  as  appropriate. 

In  practice,  these  computations  are  performed  by  replacing  integrals  of 
continuous  functions  by  summations  of  samples  of  these  functions. 

3.4  Alternative  Formulation 

By  using  a  weighting  function  G(cj)  which  is  one  for  all  frequencies  or  by 
filtering  the  original  signal,  the  error  criterion  of  Equation  (3.3)  can  be 
rewritten  as: 

£  —  ~  j  -  5w(w)|*dw  (3.12) 

In  Section  3.3,  the  synthetic  transform  5w(u>)  is  the  product  of  a  spectral 
envelope  and  a  periodic  spectrum.  Equivalently,  the  synthetic  transform 
can  be  written  as  the  transform  of  a  periodic  signal: 

M 

5w(w)  =  J2  AmW(ui  -  mu0)  (3.13) 

m=-M 
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where  M  is  the  largest  integer  such  that  Mojq  is  in  the  frequency  band 
[ — w-,  w]  and  W (u>)  is  the  Fourier  transform  of  the  window  function: 


W{u)  =  £  ttf(n)e-,wn 

n*— oo 

Equation  (3.13)  can  be  written  in  vector  notation  as 


(3.14) 


where 


5w(w)  =  wra 


W(u  +  Muq) 
W(u>  +  (M  -  l)w0) 


(3.15) 


(3.16) 


W[ijj  —  M  w0) 


(3.17) 
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In  this  notation,  the  error  criterion  of  Equation  (3.12)  can  be  expressed  as: 

£  rn  fW  |SW  (u>)  |a  dw  -  b*a  -  a*b  +  a *R*  (3.18) 

2x  J —* 

where 

R  -  f  w*wrdw  (3.19) 

2*  /-f 

and 

b  =  i-  /'  w*Sw(u/)dw  (3.20) 

J -t 

With  this  formulation,  for  a  given  fundamental  frequency  u^>,  minimizing 
the  error  criterion  of  Equation  (3.12)  results  in  the  harmonic  amplitude 
estimates  A*,  being  the  solution  to  the  following  linear  equation: 

Ra  =  b  (3.21) 

Using  these  amplitude  estimates  reduces  the  error  of  Equation  (3.18)  to: 

£  =  ~  fW  |5w(w)|*  duj  -  a® /2a  (3.22) 

2w  J-* 

which  is  equivalent  to: 

£  =  jzf_w  |5«(w)|*dw  -  ~  J  *  |Sw(w)|*dw  (3.23) 

It  should  be  noted  that  the  synthetic  transform  $w(w)  of  Equation  (3.23) 
has  been  optimised  over  the  harmonic  amplitudes  Am  and  is  therefore  con- 
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strained  to  be  evaluated  at  the  optimal  harmonic  amplitudes  for  any  partic¬ 
ular  fundamental  frequency.  We  wish  to  minimize  this  error  over  all  possible 
fundamental  frequencies.  This  is  equivalent  to  maximizing  the  second  term 
over  the  fundamental  frequency,  since  the  first  term  is  independent  of  fun¬ 
damental  frequency.  This  second  term  can  be  expressed  independent  of  the 
harmonic  amplitude  estimates  by  applying  Equation  (3.21): 


¥  =  ~  J  |5w(u;)|J  du  =  &HRa  =  bHR  xb 


(3.24) 


The  window  frequency  responses  are  orthonormal  if 


f  w'w Tdu  —  R  —  I 

2*  J -w 


(3.25) 


where  /  is  the  identity  matrix.  In  order  for  orthonormality  to  hold,  the 


window  must  be  normalized  so  that 


E  =  > 


(3.26) 


The  window  frequency  responses  are  approximately  orthonormal  when  the 
sidelobes  of  the  window  are  small  and  the  fundamental  frequency  is  larger 
than  the  width  of  the  main  lobe  so  that  the  main  lobes  of  window  fre¬ 
quency  responses  at  adjacent  harmonics  don’t  interact.  For  approximately 


Ti  « 
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orthonormal  window  frequency  responses,  we  have  JZ-1  «  I  which  yields: 

*  «  b*b  (3.27) 

This  approximation  allows  9  to  be  expressed  in  the  time  domain  as 

9 «  5Z  to*(n)«(n)tw*(lfc)e(Jk)  JZ  (3.28) 

S=-oo**=-oo  ms-M 

For  woM  =  ?r,  this  simplifies  to 

00  oo  oo 

9  «  P  J3  tt>a(n)s(n)u>J(n  -  kP)s(n  -  JfcP)  =  P  4>{kP)  (3.29) 

Jks-oon=-oo  k--oo 

where  <f>(m)  is  the  autocorrelation  function  of  u;*(n)e(n): 

OO 

4>{m)  =  w,(n)s(n)u;l(i»  -  m)s(n  -  m)  (3.30) 

nss-oo 

Thus,  maximizing  9  is  approximately  equivalent  to  maximizing  a  function 
of  the  autocorrelation  function  of  the  signal  multiplied  by  the  square  of  the 
analysis  window.  This  technique  is  similar  to  the  autocorrelation  method 
but  considers  the  peaks  at  multiples  of  the  pitch  period  instead  of  only 
the  peak  at  the  pitch  period.  This  suggests  a  computationally  efficient 
method  for  maximizing  9  over  all  integer  pitch  periods  by  computing  the 
autocorrelation  function  using  the  Fast  Fourier  Transform  (FFT)  and  then 
summing  samples  spaced  by  the  pitch  period.  It  should  be  noted  that  in 
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practice,  the  summations  of  Equation  (3.29)  are  finite  due  to  the  finite 
length  of  the  window  tu(n).  Although  this  is  a  pseudo  maximum  likelihood 
pitch  estimation  method  as  in  Wise  et  al.  [33],  it  differs  in  that  it  is  a 
frequency  domain  formulation  rather  than  a  time  domain  formulation.  One 
advantage  of  this  formulation  is  that  a  non-rectangular  analysis  window  is 
allowed.  For  a  rectangular  window,  the  result  given  by  Equations  (3.29) 
and  (3.30)  reduces  to  the  result  given  in  Wise  et  al.  [33]. 

More  accurate  pitch  period  estimates  can  be  efficiently  obtained  by 
maximising 

E  *(lkPl)  (3.31) 

k=-oo 

over  non-integer  pitch  periods  where  [xj  is  defined  as  the  largest  integer 
not  greater  than  x.  Higher  accuracy  is  obtained  in  this  method  due  to  the 
contributions  of  the  peaks  at  multiples  of  the  pitch  period  in  the  autocor¬ 
relation  function. 

Figure  3.3  shows  a  comparison  of  error  versus  pitch  period  for  two  dif¬ 
ferent  computation  methods  for  a  segment  of  speech  with  a  pitch  period 
of  approximately  85  samples  (The  pitch  period  was  determined  by  hand). 
The  first  method  computes  the  error  using  the  frequency  domain  approach 
given  by  Equation  (3.10).  The  second  method  computes  the  error  using 
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Figure  3.3:  Comparison  of  Error  Computation  Methods 
the  autocorrelation  approach  described  by  the  following  equation: 

£  »  £  u>*(n)«*(n)  -  P  +(kP)  (3.32) 

««=-00  fcs-OO 

As  can  be  seen  from  the  figure,  these  two  methods  achieve  approximately 
the  same  error  curves.  After  estimating  the  pitch  period  using  the  au¬ 
tocorrelation  domain  approach,  the  spectral  envelope  parameters  and  the 
voiced/unvoiced  parameters  can  be  estimated  as  described  in  Section  3.3.1 
and  Section  3.3.2  for  this  specific  pitch  period. 
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3.5  Bias  Correction 

As  discussed  in  Section  3.3  the  expected  value  of  the  error  of  Equation 
(3.1)  or  Equation  (3.3)  is  smaller  for  longer  pitch  periods  since  more  free 
parameters  awe  available  for  matching  the  original  spectrum.  This  effect 
can  be  seen  in  Figure  3.3  as  a  general  decrease  in  the  error  for  larger  pitch 
periods.  To  demonstrate  this  bias,  we  will  calculate  the  expected  value  of 
the  error  £  of  Equation  (3.12)  for  a  periodic  signal  p(n)  in  white  noise  d(n): 

a(n)  =  p(n)  +  d(n)  (3.33) 

where 

£{<*(»))  =  0  (3.34) 

and 

E[d(n)d(m)j  =  a*$(n  -  m)  (3.35) 

The  only  constraints  on  the  periodic  signal  p(n)  are  that  it  has  pitch  period 
P  so  that 

P(n  +  kP)  =  p(n)  (3.36) 

where  k  is  an  integer. 

Using  Equation  (3.23),  the  expected  value  of  the  error  £  of  Equation 
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(3.12)  evaluated  at  the  optimal  amplitude  estimates  for  a  given  pitch  period 
P  is  then: 


E If)  =  E  [i  JT  |Sw(cj)|’  <fc,]  -  E  [i  /'  *,]  (3.37) 

The  first  term  in  Equation  (3.37)  can  be  expressed  in  the  time  domain  as: 

E  [i £  |S.(w)l’  du]  =  E  [£_ 4(»)j  (3.38) 

For  a  window  w(n)  normalized  according  to  Equation  (3.26)  this  reduces 


to: 


E  f  53  «l(n)  =**+  53  w2(n)P2(n)  (3.39) 

Ln=-oo  n=-oo 

The  second  term  in  Equation.  (3.37)  is  the  expected  value  of  ¥  of  Equation 
(3.24)  which  can  be  written  as: 


O0  oo 


E  [tf]  «  P  53  wJ(n)u;2(n  —  kP)E  [s(n)s(n  —  kP)\  (3.40) 

&s—oo  n=— oo 

For  s(n)  consisting  of  the  sum  of  a  periodic  signal  p(n)  of  period  P  and 
white  noise.  Equation  (3.40)  reduces  to: 

E[^/\taa2P  53  w4(n)  +  P  53  w2(n)P2(n)  YL  w2(n  —  kP)  (3.41) 

— OO  «=— OO  oo 

For  slowly  changing  window  functions,  the  following  approximation  can  be 
made: 

OO  00 

p  53  tyJ(n  ”  kP) »  53  u;J(n) =  i  (3.42) 

k—-oo  ns -oo 
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This  approximation  reduces  Equation  (3.41)  to: 

“>4(n)  +  5Z  wS(»)pJ(n)  (3.43) 

n=—oo  n=— oo 

By  combining  Equation  (3.39)  and  Equation  (3.43)  a  good  approximation 
to  the  expected  value  of  the  error  i  of  Equation  (3.12)  is  obtained: 

£[£]»aMl-P  £  u/4(n))  (3.44) 

To  determine  the  accuracy  of  the  bias  approximation  given  by  Equa¬ 
tion  (3.44),  error  versus  pitch  period  curves  were  computed  for  100  differ¬ 
ent  white  noise  segments  and  averaged  together.  This  average  error  curve 
is  shown  in  Figure  3.4  together  with  the  bias  approximation  of  Equation 
(3.44).  As  can  be  seen  from  the  figure,  the  bias  approximation  is  very  close 
to  the  average  error  curve. 

An  unbiased  error  criterion  is  desired  to  prevent  longer  pitch  periods 
from  being  consistently  chosen  over  shorter  pitch  periods  for  noisy  periodic 
signals.  In  addition,  a  normalized  error  criterion  that  is  near  zero  for  a 
purely  periodic  signal  and  is  near  one  for  a  noise  signal  is  desirable.  The 
following  error  criterion  is  unbiased  with  respect  to  pitch  period  and  is 
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Pitch  Period  (Samples) 

Figure  3.4:  Average  Error  Versus  Pitch  Period 


normalized  appropriately: 


Evb  — 


S-r  |^w(w)  -  5w(w)|  du> 


[l"-PE~-O0«;«(n));:iriSw(u,)|adu; 


(3.45) 


It  is  important  to  note  that  the  error  criterion  of  Equation  (3.45)  is  inde¬ 


pendent  of  the  noise  variance  a2  so  that  estimation  of  the 


noise  variance  is 


not  required.  In  addition,  similar  results  can  be  seen  to  apply  for  colored 
noise  by  first  applying  a  whitening  filter  to  the  original  transform  5w(w) 
and  then  removing  it  from  the  final  result. 


3.6  Required  Pitch  Period  Accuracy 


In  Section  3.3.2  we  described  a  method  for  estimating  the  voiced/unvoiced 
decisions  for  each  harmonic  by  comparing  the  normalized  error  over  each 
harmonic  of  the  estimated  fundamental  to  a  threshold.  The  normalized 
error  for  each  harmonic  contains  contributions  due  to  the  difference  between 
the  estimated  harmonic  frequency  and  the  actual  harmonic  frequency  as 
well  as  the  contribution  due  to  noise  in  the  original  signal.  In  this  section, 
the  required  pitch  period  accuracy  to  prevent  differences  in  the  estimated 
and  actual  harmonic  frequencies  from  dominating  the  normalized  error  is 
determined. 

The  normalized  error  between  a  harmonic  of  a  perfectly  periodic  signal 
at  normalized  frequency  /  and  a  synthetic  harmonic  at  estimated  normal- 

A 

ized  frequency  /  depends  on  the  difference  A /  between  the  two  frequen¬ 
cies.  When  the  frequency  difference  Af  is  near  zero,  the  normalized  error 
of  Equation  (3.11)  is  near  zero.  When  the  frequency  difference  Af  is  large, 
the  normalized  error  approaches  one.  Normalized  error  versus  frequency 
difference  is  shown  in  Figure  3.5  for  a  256  point  square  root  triangular 
window.  Figure  3.6  shows  an  expanded  version  of  Figure  3.5  for  small  fre- 


Figure  3.5:  Normalized  Error  Versus  Normalized  Frequency  Difference 

quency  differences.  By  listening  to  the  synthesized  speech,  a  good  threshold 
for  the  Voiced/Unvoiced  decision  was  determined  to  be  approximately  .2. 
Consequently,  to  prevent  the  normalized  error  from  being  dominated  by  an 
inaccurate  pitch  period  estimate,  by  referring  to  Figure  3.6  we  find  that 
the  maximum  harmonic  frequency  difference  should  be  smaller  than  about 
.001.  The  pitch  period  accuracy  required  to  achieve  a  maximum  harmonic 
frequency  difference  of  .001  is  shown  in  Figure  3.7. 

The  number  of  harmonics  M  of  a  normalized  fundamental  frequency  /0 
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between  normalized  frequencies  of  zero  and  .5  is: 

w=i*y  <3-46) 

So,  the  frequency  deviation  of  the  highest  harmonic  for  an  estimated  fun¬ 

damental  of  fo  and  an  actual  fundamental  of  /o  is: 

A/  =  I  jU  (/o  -  /o)  (3.47) 

In  terms  of  pitch  periods,  Equation  (3.47)  becomes: 

A/  «  (3.48) 

where  A P  is  the  difference  between  the  actual  and  estimated  pitch  periods 
and  the  approximation  comes  from  ignoring  the  floor  function  in  Equation 
(3.47). 

Figure  3.8  shows  the  smallest  maximum  harmonic  frequency  deviation 
attainable  (AP  =  .5)  for  a  pitch  detector  which  produces  integer  pitch 
period  estimates.  This  figure  clearly  shows  that  the  maximum  harmonic 
frequency  deviation  significantly  exceeds  our  desired  value  of  .001  if  only 
integer  pitch  periods  are  used.  In  addition,  shorter  pitch  periods  have 
significantly  larger  maximum  harmonic  frequency  deviations  than  longer 
pitch  periods. 
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Figure  3.8:  Smallest  Maximum  Harmonic  Frequency  Deviation  for  Integer 
Pitch  Periods 

In  order  to  determine  the  accuracy  of  the  autocorrelation  domain  method 
described  in  Section  3.4  and  the  frequency  domain  method  described  in 
Section  3.3.1,  an  experiment  was  conducted  in  which  these  techniques  were 
used  to  estimate  the  pitch  period  of  6000  different  synthesized  periodic 
segments.  The  experiment  consisted  of  generating  100  periodic  segments 
for  each  of  60  different  2  sample  intervals  with  center  periods  of  20  to  120 
samples.  The  pitch  periods  of  the  segments  were  uniformly  distributed  in 
the  2  sample  interval.  The  phases  of  the  harmonics  were  random  with  a 
uniform  distribution  between  -x  and  x.  The  magnitudes  of  the  harmonics 
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decreased  linearly  to  zero  at  a  frequency  of  half  the  sampling  rate. 

The  maximum  deviation  and  standard  deviation  of  the  pitch  period  es¬ 
timates  are  shown  in  Figure  3.9  and  Figure  3.10  for  the  autocorrelation 
domain  and  frequency  domain  methods.  The  corresponding  maximum 


Pitch  Period  (Samples) 

Figure  3.9:  Pitch  Period  Deviation  for  Autocorrelation  Domain  Method 


deviation  and  standard  deviation  of  the  frequency  of  the  highest  harmonic 
(in  the  normalized  frequency  range  of  0  to  .5)  of  the  estimated  fundamental 
are  shown  in  Figure  3.11  and  Figure  3.12  for  the  autocorrelation  domain 
and  frequency  domain  methods.  These  figures  show  that  for  this  test, 
the  frequency  domain  method  provides  pitch  period  estimates  that  are 
approximately  10  times  more  accurate  than  the  autocorrelation  method. 
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Pitch  Period  (Samples) 

Figure  3.12:  Frequency  Deviation  of  Highest  Harmonic  for  Frequency  Do¬ 
main  Method 

From  Figure  3.11,  it  can  be  seen  that  the  maximum  harmonic  frequency 
deviation  for  the  autocorrelation  method  of  approximately  .003  is  larger 
than  our  desired  value  of  .001.  The  frequency  domain  method  is  capable 
of  more  than  sufficient  accuracy  with  a  maximum  harmonic  frequency  de¬ 
viation  near  .0002.  However,  the  autocorrelation  method  is  significantly 
more  efficient  computationally  due  to  the  possibility  of  FFT  implementa¬ 
tion.  Consequently,  we  use  the  computationally  efficient  autocorrelation 
domain  method  to  obtain  an  initial  pitch  period  estimate  followed  by  the 
more  accurate  frequency  domain  method  to  refine  the  initial  estimate. 
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3.7  Analysis  Algorithm 


The  analysis  algorithm  that  we  use  in  practice  consists  of  the  following 
steps  (See  Figure  3.13): 

1.  Window  a  speech  segment  with  the  analysis  window. 

2.  Compute  the  unbiased  error  criterion  of  Equation  (3.45)  vs.  pitch 
period  using  the  efficient  autocorrelation  domain  approach  described 
in  Section  3.4.  This  error  is  typically  computed  for  all  integer  pitch 
periods  from  20  to  120  samples  for  a  lOkHx  sampling  rate. 

3.  Use  the  dynamic  programming  approach  described  in  Section  3.3.1 
to  select  the  initial  pitch  period  estimate.  This  pitch  tracking  tech¬ 
nique  improves  tracking  through  very  low  signal  to  noise  ratio  (SNR) 
segments  while  not  decreasing  the  accuracy  in  high  SNR  segments. 

4.  Refine  this  initial  pitch  period  estimate  using  the  more  accurate  fre¬ 
quency  domain  pitch  period  estimation  method  described  in  Sec¬ 
tion  3.3.1. 

5.  Estimate  the  voiced  and  unvoiced  spectral  envelope  parameters  using 
the  techniques  described  in  Section  3.3.1. 
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Figure  3.13:  Analysis  Algorithm  Flowchart 


6.  M&ke  a  voiced/ unvoiced  decision  for  each  frequency  band  in  the  spec- 
tram.  The  number  of  frequency  bands  in  the  spectrum  can  be  as  large 
as  the  number  of  harmonics  of  the  fundamental  present  in  the  spec¬ 
trum. 

7.  The  final  spectral  envelope  parameter  representation  is  composed  by 
combining  voiced  spectral  envelope  parameters  in  those  frequency 
bands  declared  voiced  with  unvoiced  spectral  envelope  parameters  in 
those  frequency  bands  declared  unvoiced. 


Chapter  4 


Speech  Synthesis 


4.1  Introduction 

In  the  previous  two  chapters,  the  Multi-Band  Excitation  Model  parameters 
were  described  and  methods  to  estimate  these  parameters  were  developed. 
In  this  chapter,  an  approach  to  synthesizing  speech  from  the  model  pa¬ 
rameters  is  presented.  There  exist  a  number  of  methods  for  synthesizing 
speech  from  the  spectral  envelope  and  excitation  parameters.  The  following 
section  discusses  several  applicable  methods  and  selects  one  for  generating 
the  voiced  portion  of  the  synthesized  speech  and  a  second  for  generating 
the  unvoiced  portion  of  the  synthesized  speech.  The  details  of  our  speech 
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synthesis  algorithm  are  then  presented  in  Section  4.3. 


4.2  Background 

Speech  can  be  synthesized  from  the  estimated  model  parameters  using  sev¬ 
eral  different  approaches.  One  approach  is  to  generate  a  sequence  of  syn¬ 
thetic  spectral  magnitudes  from  the  estimated  model  parameters.  Then, 
algorithms  for  estimating  a  signal  from  this  synthetic  Short-Time  Fourier 
Transform  Magnitude  (STFTM)  are  applied.  In  a  second  approach,  a  syn¬ 
thetic  Short-Time  Fourier  Transform  (STFT)  is  generated.  Then,  algo¬ 
rithms  for  estimating  a  signal  from  this  synthetic  STFT  are  applied.  In  a 
third  approach,  the  synthetic  speech  signal  is  generated  in  the  time  domain 
from  the  speech  model  parameters. 

A  synthetic  STFTM  can  be  constructed  from  the  Multi-Band  Exci- 
tation  model  parameters  by  combining  segments  of  a  periodic  spectrum 
in  regions  declared  voiced  with  segments  of  a  noise  spectrum  in  regions 
declard  unvoiced  to  generate  the  excitation  spectrum.  The  noise  spec- 

i 

trum  segments  are  normalized  to  have  an  average  magnitude  per  sample 
of  unity.  A  densely  sampled  spectral  envelope  can  be  obtained  by  inter- 
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polating  between  the  samples  (|Am|)  of  the  spectral  envelope.  We  have 
used  a  constant  value  set  to  |Am|  in  voiced  regions  and  linear  interpola¬ 
tion  between  adjacent  samples  (]Am|)  in  unvoiced  regions.  The  excitation 
spectrum  is  then  multiplied  by  the  densely  sampled  spectral  envelope  to 
generate  the  synthetic  STFTM.  Nawab  has  shown  [23]  that  a  signal  can  be 
exactly  reconstructed  from  its  STFTM  under  certain  conditions.  However, 
this  algorithm  requires  the  STFTM  to  be  a  valid  STFTM  (the  STFTM 
of  some  signal) .  Due  to  the  modeling  and  synthesis  process,  the  synthetic 
STFTM  is  not  guaranteed  to  be  a  valid  STFTM.  Consequently  this  algo¬ 
rithm  cannot  be  successfully  applied  to  this  problem.  Another  algorithm, 
developed  by  Griffin  and  Lim  [8]  for  estimating  a  signal  from  a  modified 
STFTM  has  been  successfully  applied  to  this  problem  for  the  applications 
of  analysis/synthesis  and  time-scale  modification  for  both  clean  and  noisy 
speech  [9].  However,  this  algorithm  is  quite  expensive  computationally  and 
requires  a  processing  delay  of  approximately  one  second.  This  process¬ 
ing  delay  is  unacceptable  in  most  real-time  speech  bandwidth  compression 
applications. 

A  synthetic  STFT  can  be  constructed  from  the  Multi-Band  Excitation 
model  parameters  by  combining  segments  of  a  periodic  transform  in  re- 


gions  declared  voiced  with  segments  of  a  noise  transform  in  regions  declar'd 
unvoiced.  The  noise  tranform  segments  are  normalized  as  in  the  previous 
paragraph  and  a  densely  sampled  spectral  envelope  is  generated.  The  phase 
of  the  samples  in  voiced  regions  is  set  to  the  phase  of  the  spectral  envelope 
samples  Am.  The  weighted  overlap-add  algorithm  [8]  can  then  be  used  to 
estimate  a  signal  with  STFT  closest  to  this  synthetic  STFT  in  the  least- 
squares  sense.  One  problem  with  this  approach  is  that  the  voiced  portion 
of  the  synthesized  signal  is  modeled  as  a  periodic  signal  with  constant  fun¬ 
damental  over  the  entire  frame.  When  small  window  shifts  are  used  in  the 
analysis/synthesis  system,  a  fairly  continuous  fundamental  frequency  vari¬ 
ation  is  allowed  as  observed  in  the  STFTM  of  the  original  speech.  However, 
when  large  window  shifts  are  used  (as  is  necessary  to  reduce  the  bit-rate 
for  speech  coding  applications)  the  large  potential  change  in  fundamental 
frequency  from  one  frame  to  the  next  causes  time  discontinuities  in  the 
harmonics  of  the  fundamental  in  the  STFTM. 

A  third  approach  to  synthesizing  speech  involves  synthesizing  the  voiced 
and  unvoiced  portions  in  the  time  domain  and  then  adding  them  together. 
The  voiced  signal  can  be  synthesized  as  the  sum  of  sinusoidal  oscillators 
with  frequencies  at  the  harmonics  of  the  fundamental  and  amplitudes  set 


by  the  spectral  envelope  parameters.  This  technique  has  the  advantage  of 
allowing  a  continuous  variation  in  fundamental  frequency  from  one  frame 
to  the  next  eliminating  the  problem  of  time  discontinuities  in  the  harmonics 
of  the  fundamental  in  the  STFTM.  The  unvoiced  signal  can  be  synthesized 
as  the  sum  of  bandpass  filtered  white  noise. 

4.3  Speech  Synthesis  Algorithm 

A  time  domain  method  was  selected  for  synthesizing  the  voiced  portion  of 
the  synthetic  speech.  This  method  was  selected  due  to  its  advantage  of 
allowing  a  continuous  variation  in  fundamental  frequency  from  frame  to 
frame.  A  frequency  domain  (STFT)  method  was  selected  for  synthesizing 
the  unvoiced  portion  of  the  synthetic  speech.  This  method  was  selected  due 
to  the  ease  and  efficiency  of  implementation  of  a  filter  bank  in  the  frequency 
domain  with  the  Fast  Fourier  Transform  (FFT)  algorithm.  Speech  is  then 
synthesized  as  the  sum  of  the  synthetic  voiced  signal  and  the  synthetic 
unvoiced  signal. 

As  discussed  in  the  previous  section,  voiced  speech  can  be  synthesized 


in  the  time  domain  as  the  sum  of  sinusoidal  oscillators: 

«»(t)  =  £  ^«(*)  cos(0m(*))  (4.1) 

m 

The  amplitude  function  Am(t)  is  linearly  interpolated  between  frames  with 
the  amplitudes  of  harmonics  marked  un voiced  set  to  zero.  The  phase  func¬ 
tion  0m(t)  is  determined  by  an  initial  phase  <&>  and  a  frequency  track  u>m (£). 

0m(O  =  [  wm(£)d£  +  ^0  (4.2) 

Jo 

The  frequency  track  u/m(£)  is  linearly  interpolated  between  the  mth  har¬ 
monic  of  the  current  frame  and  that  of  the  next  frame  as  follows: 

<*>m(t)  =  mw0(0)-^-£-^  +  mu;o(S)^  +  Awm  (4.3) 

where  u>o(0)  and  w0(5)  are  the  fundamental  frequencies  at  t  =  0  and  t  =  S 
respectively  and  5  is  the  window  shift.  The  initial  phase  <t>0  and  frequency 
deviation  Au/m  parameters  are  chosen  so  that  the  principal  values  of  0m(O) 
and  0m(S)  are  equal  to  the  measured  harmonic  phases  in  the  current  and 
next  frame.  When  the  mth  harmonics  of  the  current  and  next  frames  are 
both  declared  voiced,  the  initial  phase  <f>o  is  set  to  the  measured  phase  of 
the  current  frame  and  Aum  is  chosen  to  be  the  smallest  frequency  deviation 
required  to  match  the  phase  of  the  next  frame.  When  either  of  the  har¬ 
monics  is  declared  unvoiced,  only  the  initial  phase  parameter  4> o  is  required 


to  match  the  phase  function  0m[t)  with  the  phase  of  the  voiced  harmonic 
(Aa>m  is  set  to  zero).  When  both  harmonics  are  declared  unvoiced,  the 
amplitude  function  Am(t)  is  zero  over  the  entire  interval  between  frames  so 
any  phase  function  will  suffice. 

Large  differences  in  fundamental  frequency  can  occur  between  adjacent 
frames  due  to  word  boundaries  and  other  effects.  In  these  cases,  linear 
interpolation  of  the  fundamental  frequency  between  frames  is  a  poor  model 
of  fundamental  frequency  variation  and  can  lead  to  artifacts  in  the  synthe¬ 
sized  signal.  Consequently,  when  fundamental  frequency  changes  of  more 
than  10  percent  are  encountered  from  frame  to  frame,  the  voiced  harmon¬ 
ics  of  the  current  frame  and  the  next  frame  axe  treated  as  if  followed  and 
preceded  respectively  by  unvoiced  harmonics. 

The  unvoiced  speech  has  been  generated  by  taking  the  STFT  of  a  white 
noise  sequence  and  zeroing  out  the  frequency  regions  marked  voiced.  The 
samples  in  the  unvoiced  regions  are  then  normalized  to  have  the  desired 
average  magnitude  specified  by  the  spectral  envelope  parameters.  The  syn¬ 
thetic  unvoiced  speech  can  then  be  produced  from  this  synthetic  STFT 
using  the  weighted  overlap-add  method.  It  should  be  noted  that  this  algo¬ 
rithm  can  synthesize  the  unvoiced  portion  of  the  synthetic  speech  signal  on 


a  frame  by  frame  basis  for  real-time  synthesis. 


4.4  Speech  Synthesis  System 

A  block  diagram  of  our  current  speech  synthesis  system  is  shown  in  Fig¬ 
ures  4.1  through  4.4.  First,  the  spectral  envelope  samples  are  separated  into 
voiced  or  unvoiced  spectral  envelope  samples  depending  on  whether  they 
are  in  frequency  bands  declared  voiced  or  unvoiced  (Figure  4.1).  Voiced 
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Figure  4.1:  Separation  of  Envelope  Samples 


envelope  samples  in  frequency  bands  declared  unvoiced  are  set  to  zero  as 
are  unvoiced  envelope  samples  in  frequency  bands  declared  voiced.  Voiced 
envelope  samples  include  both  magnitude  and  phase  whereas  unvoiced  en¬ 
velope  samples  include  only  the  magnitude. 


Voiced  speech  is  synthesized  from  the  voiced  envelope  samples  by  sum¬ 
ming  the  outputs  of  a  bank  of  sinusoidal  oscillators  running  at  the  har¬ 
monics  of  the  fundamental  frequency  (Figure  4.2).  The  amplitudes  of  the 
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Figure  4.2:  Voiced  Speech  Synthesis 


oscillators  are  set  to  the  magnitudes  of  the  envelope  samples  with  linear  in¬ 
terpolation  between  frames.  The  phase  tracks  of  the  oscillators  are  adjusted 
to  match  the  phases  of  the  envelope  samples. 

Unvoiced  speech  is  synthesized  from  the  unvoiced  envelope  samples  by 
first  synthesizing  a  white  noise  sequence.  For  each  frame,  the  white  noise 
sequence  is  windowed  and  an  FFT  is  applied  to  produce  samples  of  the 
Fourier  transform  (Figure  4.3).  A  sample  of  the  spectral  envelope  is  esti¬ 
mated  in  each  frequency  band  by  averaging  together  the  magnitude  of  the 
FFT  samples  in  that  band.  This  spectal  envelope  is  then  replaced  by  the 
unvoiced  spectral  envelope  generated  from  the  unvoiced  envelope  samples. 
This  unvoiced  spectral  envelope  is  obtained  by  linear  interpolation  between 


Figure  4.3:  Unvoiced  Speech  Synthesis 

the  unvoiced  envelope  samples.  These  synthetic  transforms  are  then  used 
to  synthesize  unvoiced  speech  using  the  weighted  overlap-add  method. 

The  final  synthesized  speech  is  generated  by  summing  the  voiced  and 
unvoiced  synthesized  speech  signals  (Figure  4.4). 


Speec 


Figure  4.4:  Speech  Synthesis 
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Chapter  5 


Application  to  the  Development 


of  a  High  Quality  8  kbps  Speech 


Coding  System 


5.1  Introduction 


Among  many  applications  of  our  new  model,  we  considered  the  problem 
of  bit-rate  reduction  for  speech  transmission  and  storage.  In  a  number  of 
speech  coding  applications,  it  is  important  to  reproduce  the  original  clean 
or  noisy  speech  as  closely  as  possible.  For  example,  in  mobile  telephone 
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applications,  users  would  like  to  be  able  to  identify  the  person  on  the  other 
end  of  the  phone  and  are  usually  annoyed  at  any  artificial  sounding  degra¬ 
dations.  These  degradations  are  particularly  severe  for  most  vocoders  when 
operating  in  noisy  environments  such  az.  a  moving  car.  Consequently,  for 
these  applications,  we  are  interested  in  both  the  quality  and  intelligibility  of 
the  reproduced  speech.  In  other  applications,  such  as  a  fighter  cockpit,  the 
message  is  of  primary  importance.  For  these  applications,  we  are  interested 
mainly  in  the  intelligibility  of  the  reproduced  speech. 

To  demonstrate  the  performance  of  the  Multi-Band  Excitation  Speech 
Analysis/Synthesis  System  for  this  problem,  an  8  kbps  speech  coding  sys¬ 
tem  was  developed.  Since  our  primary  goal  is  to  demonstrate  the  high 
performance  of  the  Multi-Band  Excitation  Model  and  the  corresponding 
speech  analysis  methods,  fairly  conventional  and  simple  parameter  coding 
methods  have  been  used  to  facilitate  comparison  with  other  systems.  Even 
though  simple  coding  methods  have  been  used,  the  results  are  quite  good. 

The  major  innovation  in  the  Multi-Band  Excitation  Speech  Model  is  the 
ability  to  declare  a  large  number  of  frequency  regions  as  containing  periodic 
or  aperiodic  energy.  To  determine  the  advantage  of  this  new  model,  the 
Multi-Band  Excitation  Speech  Coder  operating  at  8  kbps  was  compared 


to  a  system  using  a  single  V/UV  bit  per  frame  (Single  Band  Excitation 
Vocoder).  The  Single  Band  Excitation  (SBE)  Coder  employs  exactly  the 
same  parameters  as  the  Multi-Band  Excitation  Speech  Coder  except  that 
one  V/UV  bit  per  frame  is  used  instead  of  12.  Although  this  results  in  a 
somewhat  smaller  bit-rate  for  the  more  conventional  coding  system  (7.45 
kbps) ,  we  wished  to  maintain  the  same  coding  rates  for  the  other  parameters 
in  order  to  focus  the  comparison  on  the  usefulness  of  the  V /U V  information 
rather  than  particular  modeling  or  coding  methods  for  the  other  parame¬ 
ters.  In  addition,  this  avoids  the  problem  of  trying  to  optimally  assign  these 
11  bits  to  coding  the  other  parameters  and  the  subsequent  multitudes  of 
DRT  tests  to  evaluate  all  possible  combinations. 

5.2  Coding  of  Speech  Model  Parameters 

A  25.6  ms  Hamming  window  was  used  to  segment  4  kHz  bandwidth  speech 
sampled  at  10  kHz.  The  estimated  speech  model  parameters  were  coded 
at  8  kbps  using  a  50  Hz  frame  rate.  This  allows  160  bits  per  frame  for 
coding  of  the  harmonic  magnitudes  and  phases,  fundamental  frequency,  and 
voiced/unvoiced  information.  The  number  of  bits  allocated  to  each  of  these 


parameters  per  frame  is  displayed  in  Table  5.1.  As  discussed  in  Chapter 


Parameter 

Bits 

Harmonic  Magnitudes 

139-94 

Harmonic  Phases 

0-45 

Fundamental  Frequency 

9 

Voiced/Unvoiced  Bits 

12 

Total 

160 

Table  5.1:  Bit  Allocation  per  Frame 

4,  phase  is  not  required  for  harmonics  declared  unvoiced.  Consequently, 
bits  assigned  to  phases  declared  unvoiced  are  reassigned  to  the  magnitude. 
So,  when  all  harmonics  are  declared  voiced,  45  bits  are  assigned  for  phase 
coding  and  94  bits  are  assigned  for  magnitude  coding.  At  the  other  extreme, 
when  all  harmonics  are  declared  unvoiced,  no  bits  are  assigned  to  phase  and 
139  bits  are  assigned  for  magnitude  coding. 
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5.2.1  Coding  of  Harmonic  Magnitudes 


The  harmonic  magnitudes  are  coded  using  the  same  techniques  employed  by 
channel  vocoders  [11].  In  this  method,  the  logarithms  of  the  harmonic  mag¬ 
nitudes  are  encoded  using  adaptive  differential  PCM  across  frequency.  The 
log-magnitude  of  the  first  harmonic  is  coded  using  S  bits  with  a  quantiza¬ 
tion  step  size  of  2  dB.  The  number  of  bits  assigned  to  coding  the  difference 
between  the  log-magnitude  of  the  mth  harmonic  and  the  coded  value  of 
the  previous  harmonic  (within  the  same  frame)  is  determined  by  summing 
samples  of  the  bit  density  curve  of  Figure  5.1  over  the  frequency  interval 
occupied  by  the  mth  harmonic.  The  available  bits  for  coding  the  magni¬ 
tude  are  then  assigned  to  each  harmonic  in  proportion  to  these  sums.  For 
example,  Figure  5.2  shows  the  number  of  bits  assigned  to  code  each  har¬ 
monic  of  a  coded  fundamental  frequency  of  .01  (normalized  frequency) .  The 
coded  value  of  the  fundamental  is  used  so  that  the  number  of  bits  allocated 
to  each  harmonic  can  be  determined  at  the  receiver  from  the  transmitted 
coded  fundamental  frequency.  The  number  of  bits  assigned  to  each  har¬ 
monic  in  Figure  5.2  is,  in  general,  non-integer.  For  a  non-integer  number 
of  bits,  the  integer  part  is  taken  and  the  fractional  part  is  added  to  the  bits 


1 


Frequency  (kHz) 


4 


5 


assigned  to  the  next  harmonic.  The  quantization  step  size  depends  on  the 


number  of  bits  assigned  and  is  listed  in  Table  5.2. 


Bits 

Step  Size  (dB) 

Min  (dB) 

Max  (dB) 

1 

8 

-4 

4 

2 

6.5 

-9.75 

9.75 

3 

5 

-17.5 

17.5 

D 

3 

-22.5 

22.5 

5 

2 

-31 

31 

6 

1 

-31.5 

31.5 

D 

0.5 

-31.75 

31.75 

8 

0.25 

-31.875 

31.875 

f 


Table  5.2:  Quantization  Step  Sizes 


5.2.2  Coding  of  Harmonic  Phases 

When  generating  the  STFT  phase,  the  primary  consideration  in  high  qual¬ 
ity  synthesis  is  to  generate  the  STFT  phase  so  that  the  phase  difference 
from  frame  to  frame  is  consistent  with  the  fundamental  frequency  in  voir*  • 
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regions.  Obtaining  the  correct  relative  phase  between  harmonics  is  of  sec¬ 
ondary  importance  for  high  quality  synthesis.  However,  results  of  informal 
listening  indicate  that  incorrect  relative  phase  between  harmonics  can  cause 
a  variety  of  perceptual  differences  between  the  original  and  synthesized 
speech  especially  at  low  frequencies.  Consequently,  the  phases  of  harmon¬ 
ics  declared  voiced  are  encoded  by  predicting  the  phase  of  the  current  frame 
from  the  phase  of  the  previous  frame  using  the  average  fundamental  fre¬ 
quency  for  the  two  frames.  Then,  the  difference  between  the  predicted  and 
estimated  phase  for  the  current  frame  is  coded  starting  with  the  phases  of 
the  low  frequency  harmonics.  The  difference  between  the  predicted  and  es¬ 
timated  phase  is  set  to  zero  for  any  uncoded  voiced  harmonics  to  maintain  a 
frame  to  frame  phase  difference  consistent  with  the  fundamental  frequency. 
An  example  of  phase  coding  is  shown  in  Figures  5.3  through  5.6  for  a  frame 
of  speech  in  which  all  frequency  bands  were  declared  voiced.  The  phases  of 
harmonics  in  frequency  regions  declared  unvoiced  do  not  need  to  be  coded 
since  they  are  not  required  by  the  speech  synthesizer. 

The  difference  between  the  predicted  and  estimated  phase  can  be  coded 
using  uniform  quantization  to  code  the  first  N  harmonics  between  —  ir  and 
ir.  For  the  8  kbps  system,  the  phases  of  the  first  12  harmonics  (starting 
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Figure  5.3:  Estimated  Harmonic  Phases 
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Figure  5.5:  Difference  Between  Estimated  and  Predicted  Phases 
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Figure  5.0:  Coded  Phase  Differences 


05 


at  low  frequency)  were  coded  using  approximately  13  levels  per  harmonic. 
This  coding  method  is  simple  and  produces  fairly  good  results.  However,  it 
fails  to  take  advantage  of  the  expected  concentration  of  the  phase  differences 
around  zero  for  consecutive  voiced  harmonics. 

To  show  the  distribution  of  phase  differences  for  several  frequency  bands, 
six  speech  sentences  were  processed  and  the  composite  histograms  gener¬ 
ated.  The  phase  differences  accumulated  were  the  difference  between  the 
predicted  and  estimated  phase  of  the  harmonics  that  were  declared  voiced 
in  consecutive  frames.  As  indicated  in  Figures  5.7  through  5.9,  the  phase 
differences  tend  to  be  concentrated  around  zero  especially  for  low  frequen¬ 
cies.  For  higher  frequencies,  the  distribution  tends  to  become  more  uniform 
as  the  phases  of  the  higher  frequency  harmonics  become  less  predictable. 

Several  methods  are  available  for  reducing  the  average  number  of  bits 
required  to  code  a  parameter  at  a  given  average  quantisation  error.  In 
entropy  coding  [31],  the  parameter  is  uniformly  quantised  with  L  quanti¬ 
sation  levels  and  a  symbol  is  assigned  to  the  «**  quantisation  level.  The 
minimum  average  achievable  rate  to  code  these  symbols  is  given  by  the 
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Figure  5.0:  Phase  Difference  Histogram  (1.0  -  l.SkHs) 

entropy: 

*  -  £  ^(tt)  lo«t  p{Vi)  (51) 

<>i 


In  entropy  coding,  the  number  of  bits  sssigned  to  the  symbol  y<  is: 


Bi  m  —  log,  P(y<) 


(5.2) 


so  that  shorter  code  words  are  used  for  more  probable  symbols.  The  approx¬ 
imation  occurs  in  Equation  (5.2)  since  -  log,  P(yt)  may  not  be  an  integer 
value.  The  resulting  variable  length  code  achieves  an  average  rate  close  to 
the  entropy.  Constructive  methods  exist  [13]  for  generating  optimum  vari¬ 
able  length  codes.  The  problem  with  entropy  coding  is  that  if  a  number  of 
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improbable  events  occur  closely  spaced  in  time,  a  large  delay  is  required  to 
transmit  the  code  words  which  can  result  in  unacceptably  long  pauses  in 
the  synthesis  end  of  a  speech  coding  system  in  addition  to  requiring  a  large 
data  buffer. 

In  Lloyd-Max  quantization  [16],  [20],  nonuniform  quantization  is  used 
to  minimize  the  average  quantization  error  for  a  given  number  of  quantiza¬ 
tion  levels.  An  equal  number  of  bits  is  then  used  to  code  each  level.  This 
coding  method  has  the  advantage  of  having  fixed  length  code  words.  How¬ 
ever,  parameter  values  with  low  probability  are  often  coded  with  a  large 
quantization  error. 

An  L  level  Lloyd-Max  quantiser  is  specified  by  the  end  points  x,-  of  each 
of  the  L  input  ranges  and  an  output  level  ft  corresponding  to  each  input 
range.  We  then  define  a  distortion  function 

D  =  £  n  ~  »)p(*)<&  (5-3) 

where  /(x)  is  some  function  (we  used  f(x)  —  xs)  and  p(x)  is  the  input 
amplitude  probability  density.  The  objective  is  to  choose  the  x,-’s  and  the 
corresponding  ft’s  to  minimize  this  distortion  function.  Several  iterative 
methods  exist  [16],  [20]  for  minimizing  this  distortion  function. 
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Table  5.3  shows  the  reduction  in  quantization  error  in  dB  for  a  13  level 
Lloyd-Max  quantiser  over  a  13-level  uniform  quantizer.  As  expected,  sig- 


Freq  (kHz) 

Improvement  (dB) 

4.4 

3.2 

1.7 

1.6 

0.05 

Table  5.3:  Quantization  Error  Reduction 


nificantly  more  improvement  is  obtained  for  the  more  predictable  lower 
frequencies. 

Due  to  the  improved  performance  of  the  Lloyd-Max  quantizer  over  a 

uniform  quantizer  and  the  advantage  of  fixed  length  code  words  over  en- 

% 

tropy  coding,  the  Lloyd-Max  quantizer  was  employed  in  the  8  kbps  MBE 


Coder. 


5.2.3  Coding  of  V /UV  Information 

The  voiced /unvoiced  information  can  be  encoded  using  a  variety  of  meth¬ 
ods.  We  have  observed  that  voiced/unvoiced  decisions  tend  to  cluster  in 
both  frequency  and  time  due  to  the  slowly  varying  nature  of  speech  in  the 
STFTM  domain.  Run-length  coding  can  be  used  to  take  advantage  of  this 
expected  clustering  of  voiced/unvoiced  decisions.  However,  run-length  cod¬ 
ing  requires  a  variable  number  of  bits  to  exactly  encode  a  fixed  number  of 
samples.  This  makes  implementation  of  a  fixed  rate  coder  more  difficult. 

A  simple  approach  to  coding  the  voiced/unvoiced  information  with  a 
fixed  number  of  bits  while  providing  good  performance  was  developed.  In 
this  approach,  if  N  bits  are  available,  the  spectrum  is  divided  into  N  equal 
frequency  bands  and  a  voiced/unvoiced  bit  is  used  for  each  band.  The 
voiced/unvoiced  bit  is  set  by  comparing  a  weighted  sum  of  the  normalized 
errors  of  all  of  the  harmonics  in  a  particular  frequency  band  to  a  threshold. 
When  the  weighted  sum  is  less  than  the  threshold,  the  frequency  band  is 
set  to  voiced.  When  the  weighted  sum  is  greater  than  the  threshold,  the 
frequency  band  is  set  to  unvoiced.  The  sum  is  weighted  by  the  estimated 
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harmonic  magnitudes  as  follows: 


<s-> 

where  m  is  summed  over  all  of  the  harmonics  in  the  kth  frequency  band. 

5.3  Coding  -  Summary 

The  methods  used  for  coding  the  MBE  model  parameters  are  summarized 
in  Figures  5.10  through  5.13.  The  fundamental  frequency  is  coded  using 
uniform  quantization  (Figure  5.10). 


Fundamental 

Uniform 

Coded  Fundamental 

Frequency 

Quantization 

Frequency 

Figure  5.10:  Fundamental  Frequency  Coding 


The  estimated  phases  are  coded  by  predicting  the  phases  of  the  current 
frame  from  the  coded  phases  in  the  previous  frame  using  the  coded  fun¬ 
damental  frequency  (Figure  5.11).  The  difference  between  the  predicted 
phases  and  the  estimated  phases  are  then  coded  using  Lloyd-Max  quan¬ 
tization.  Only  the  phases  of  the  M  lowest  frequency  harmonics  declared 
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Figure  5.11:  Coding  of  Phases 

voiced  are  coded  since  these  appear  to  be  more  important  perceptually. 
The  phases  of  harmonics  declared  unvoiced  are  not  coded  since  they  are 
not  required  by  the  synthesis  algorithm  and  the  bits  allocated  to  them  are 
used  to  code  the  magnitude  samples. 

The  magnitude  samples  are  coded  by  coding  the  lowest  frequency  mag¬ 
nitude  sample  using  uniform  quantization.  The  remaining  magnitudes  for 
the  current  frame  are  coded  using  adaptive  differential  PCM  across  fre¬ 
quency  (Figure  5.12).  The  number  of  bits  assigned  to  coding  each  mag¬ 
nitude  sample  is  determined  from  the  coded  fundamental  frequency  by 
summing  a  bit  distribution  curve  as  described  in  Section  5.2.1. 

The  V/UV  information  is  coded  by  dividing  the  original  spectrum  into 
N  frequency  bands  ( N  =  12  for  the  8  kbps  system).  The  error  (closeness  of 
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Fundamental  Determine  Bit 


Figure  5.12:  Coding  of  Magnitudes 

fit)  is  determined  between  each  frequency  band  of  the  original  spectrum  and 
the  corresponding  frequency  band  of  the  synthesized  all-voiced  spectrum 
(Figure  5.13).  A  threshold  is  then  used  to  set  a  V/UV  bit  for  each  frequency 


Figure  5.13:  Coding  of  V/UV  Information 


band.  When  the  error  for  a  frequency  band  is  below  the  threshold,  the  all- 
voiced  synthetic  spectrum  is  a  good  match  for  the  original  spectrum  and 
this  frequency  band  is  declared  voiced.  When  the  error  for  a  frequency  band 


is  above  the  threshold,  the  all-voiced  synthetic  spectrum  is  a  poor  match 
for  the  original  spectrum  and  this  frequency  band  is  declared  unvoiced. 

The  8kbps  MBE  Coder  was  implemented  on  a  MASSCOMP  computer 
(68020  CPU)  in  the  C  programming  language.  The  entire  system  (analy¬ 
sis,  coding,  synthesis)  requires  approximately  1  minute  of  processing  time 
per  second  of  input  speech  on  this  general  purpose  computer  system.  The 
increased  throughput  available  from  special  purpose  architectures  and  con¬ 
version  from  floating  point  to  fixed  point  should  make  these  algorithms 

implementable  in  real-time  with  several  Digital  Signal  Processing  (DSP) 

♦ 

chips. 


5.4  Quality  -  Informal  Listening 

Informal  listening  was  used  to  compare  a  number  of  speech  sentences  pro¬ 
cessed  by  the  Multi-Band  Excitation  Speech  Coder  and  the  Single  Band 
Excitation  Speech  Coder.  For  clean  speech,  the  speech  sentences  coded 
by  the  MBE  Speech  Coder  did  not  have  the  slight  “buzziness”  present  in 
some  regions  of  speech  processed  by  the  SBE  Speech  Coder.  Figure  5.14 
shows  a  spectrogram  of  the  sentence  “He  has  the  bluest  eyes”  spoken  by  a 
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He  has 


the  bluest 


eyes 


Figure  5.14:  Uncoded  Clean  Speech  Spectrogram 


the  energy  versus  time  (0-2  seconds,  horizontal  axis)  and  frequency  (0  - 
5  kHz,  vertical  axis).  Periodic  energy  is  typified  by  the  presence  of  parallel 
horizontal  bars  of  darkness  which  occur  at  the  harmonics  of  the  funda¬ 
mental  frequency.  One  region  of  particular  interest  is  the  /h/  phoneme  in 
the  word  "has”.  In  this  region,  several  harmonics  of  the  fundamental  fre¬ 
quency  appear  in  the  low  frequency  region  while  the  upper  frequency  region 
is  dominated  by  aperiodic  energy.  The  Multi-Band  Excitation  Vocoder  op- 
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erating  at  8kbps  reproduces  this  region  quite  faithfully  using  12  V/UV  bits 
(Figure  5.15).  The  SBE  Vocoder  declares  the  entire  spectrum  voiced  and 


He  has  the  bluest  eyes 


Figure  5.15:  MBE  Vocoder  -  Clean  Speech  Spectrogram 

replaces  the  aperiodic  energy  apparent  in  the  original  spectogram  with  har¬ 
monics  of  the  fundamental  frequency  (Figure  5.16).  This  causes  a  “b tizzy” 
sound  in  the  speech  synthesized  by  the  SBE  Vocoder  which  is  eliminated  by 
the  MBE  Vocoder.  The  MBE  Vocoder  produces  fairly  high  quality  speech 
at  8  kbps.  The  major  degradation  in  these  two  systems  (other  than  the 
“buzz mess”  in  the  SBE  Vocoder)  is  a  slightly  reverberant  quality  due  to 
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the  Imre*  synthesis  windows  (40  ms  trimngulnr  windows)  mad  the  lmck  of 
enough  fiHsd  phase  inlbnnmiion 


For  speech  corrupted  by  additive  rmadom  mdse  (Figure  5.17),  the  SBE 
Coding  System  (Figure  5.10)  hmd  severe  “bussiness”  and  a  number  of 
voiced/umroiced  errors.  Tbs  severe  "bussinsas"  is  due  to  replacing  the 


the  bluest 


eyas 


Figure  5.17:  Uncoded  Noisy  Speech  Spectrogram 

aperiodic  energy  evident  in  the  original  spectrogram  by  harmonics  of  the 
fundamental  frequency.  The  V/UV  errors  occur  due  to  dominance  of  the 
aperiodic  energy  in  all  but  a  few  small  regions  of  the  spectrum.  The 
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voiced/ unvoiced  threshold  could  not  bo  raised  further  without  a  Urge  num¬ 
ber  of  the  totally  unvoiced  frames  being  declared  voiced.  The  noisy  speech 
sentences  processed  by  the  Multi-Band  Excitation  Speech  (for  example,  see 
Figure  5.18)  Coder  didn’t  have  the  severs  "bussineas*  present  in  the  Single 
Band  Excitation  Speech  Coder  and  didn't  seem  to  have  a  problem  with 
voiced /unvoiced  errors  since  much  smaller  frequency  regions  are  covered 
by  each  V/UV  decision.  In  addition,  the  sentences  processed  by  the  MBE 


Figure  5.18:  MBE  Vocoder  -  Noisy  Speech  Spectrogram 
Vocoder  sound  very  close  to  the  original  noisy  speech. 
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5.5  Intelligibility  -  Diagnostic  Rhyme  Tests 

The  Diagnostic  Rhyme  Test  (DRT)  was  developed  to  provide  a  measure 
of  the  intelligibility  of  speech  signals.  The  DRT  is  a  refinement  of  earlier 
intelligibility  tests  such  as  the  Rhyme  Test  developed  by  Fairbanks  [4]  and 
the  Modified  Rhyme  Tsst  developed  by  House  ct  aL  [12].  The  form  of  the 
DRT  used  here  is  described  in  detail  in  Voiers  [32]. 

The  DRT  consists  of  listening  to  a  sequence  of  words  spoken  by  the 
same  speaker.  Each  of  the  words  spoken  is  one  of  a  set  of  two  rhyming 
monosyllabic  words.  The  lirtensr  must  then  choose  which  of  the  two  words 
was  spoken  for  each  word  in  the  sequence.  The  DRT  word  pairs  were 
chosen  so  that  only  the  initial  consonant  differs  in  order  to  minimise  the 
effects  of  context.  One  DRT  consists  of  102  test  words  in  addition  to  some 
filler  words  spoken  by  a  single  speaker  and  corresponds  to  approximately 
7  minutes  of  speech.  The  DRT  score  is  adjusted  to  remove  the  effects  of 
guessing  so  that  random  guessing  would  achieve  a  score  of  sero  on  average. 
No  errors  in  a  DRT  corresponds  to  a  score  of  100. 

The  DRT  was  employed  to  compare  uncoded  speech  with  the  ft  kbps 
Multi-Band  Excitation  Vocoder  (12  V/UV  bits  per  frame)  and  the  Single 
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Band  Excitation  Vocoder  (1  V/UV  bit  per  frame).  Two  conditions  were 
tested:  1)  clean  speech,  and  2)  speech  corrupted  by  additive  white  Gaussian 
noise.  Based  on  the  informal  listening  in  the  previous  section,  we  expect 
the  scores  for  the  two  vocoders  to  be  very  close  for  clean  speech  since  only 
a  slight  quality  improvement  was  noted  for  this  case.  For  noisy  speech, 
the  MBE  Vocoder  provides  a  significant  quality  improvement  over  the  SBE 
Vocoder  which  leads  us  to  expect  a  measurable  intelligibility  improvement. 
The  noise  level  was  adjusted  to  produce  approximately  a  5  dB  peak  signal 
to  noise  ratio  in  the  noisy  speech.  However,  since  amplitudes  of  the  words 
on  the  DRT  tapes  differed  significantly  from  each  other,  the  SNR  varied 
substantially  from  word  to  word.  In  these  tests,  we  are  interested  in  the 
relative  performance  of  the  vocoders  in  the  same  background  noise  which 
makes  the  noise  level  uncritical. 

The  DRT  scores  presented  for  clean  speech  (Table  5.4  and  Figure  5.20) 
and  noisy  speech  (Table  5.5  and  Figure  5.21)  were  generated  from  three 
male  speakers  and  10  listeners.  Figures  5.20  and  5.21  are  bar  graphs  that 
show  the  average  DRT  scores  and  one  standard  deviation  above  and  below 
them.  Each  of  the  18  DRT  tests  taken  by  each  listener  was  generated  from 
an  original  set  of  3  DRT  tests  (one  for  each  speaker)  by  randomly  rearrang- 


Speaker 


System 

Type 

CH 

JE 

RH 

Average 

Uncoded 

Mean 

97.6 

95.7 

97.3 

96.9 

S.  D. 

.36 

.55 

.28 

8  kbps  MBE 

Mean 

93.5 

95.8 

93.6 

EH 

.90 

H 

.69 

.53 

Conventional 

Mean 

93.4 

B 

95.1 

93.4 

S.  D. 

.84 

n 

.51 

.49 

Table  5.4:  DET  Scores  -  Clean  Speech 

ing  the  word  pur  order  for  each  test  to  prevent  memorization  by  listeners. 
The  listeners  Were  inexperienced  initially  and  were  given  4-6  practice  DRT 
tests  until  they  became  comfortable  with  the  tests  and  produced  reliable 
scores.  The  scores  presented  in  the  tables  were  computed  by  eliminating 
outliers  in  the  original  listeners’  scores  and  then  computing  the  mean  and 
an  estimate  of  the  standard  deviation  of  this  mean  assuming  a  Gaussian 
density  for  the  listener  scores.  Outliers  were  eliminated  by  computing  the 


average  of  the  scores  and  removing  the  two  scores  furthest  from  the  average. 
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Table  5.5:  DRT  Scores  -  Noisy  Speech 

The  remaining  8  scores  were  then  used  to  estimate  the  mean  and  standard 
deviation.  Since  the  relative  DRT  scores  are  of  primary  interest,  Tables  5.6 
and  5.7  show  the  mean  and  standard  deviation  of  the  difference  between 
the  listeners’  DRT  scores  for  uncoded  speech  and  speech  processed  by  the 
two  Vocoders. 

For  clean  speech,  as  expected,  several  points  are  lost  going  from  uncoded 
to  coded  due  to  lowpass  filtering  inherent  in  the  vocoders  and  degradations 
introduced  by  coding.  Also,  the  intelligibility  scores  are  approximately  the 
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Speaker 

Systems 

Type 

CH 

RH 

Average 

Uncoded 

Mean 

3.3 

l 

H 

3.0 

-  8  kbps  MBE 

S.  D. 

.95 

.58 

.50 

Uncoded 

Mean 

D 

m 

2.2 

3.2 

-SBE 

S.  D. 

.72 

.65 

.46 

8  kbps  MBE 

Mean 

■ 

-.26 

.78 

.2 

-SBE 

S.  D. 

.64 

.44 

.51 

.31 

Table  5.6:  DRT  Score  Differences  -  Clean  Speech 

same  for  the  MBE  Vocoder  and  the  SBE  Vocoder. 

For  noisy  speech,  the  MBE  Vocoder  performs  an  average  of  about  6 
points  better  than  the  SBE  Vocoder  while  performing  only  about  2.6  points 
worse  than  the  uncoded  noisy  speech.  This  demonstrates  the  utility  of  the 
extra  voiced/unvoiced  bands  in  the  Multi-Band  Excitation  Vocoder. 
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Table  5.7:  DRT  Score  Differences  -  Noisy  Speech 

5.6  DRT  Scores  -  RADC 


DRT  test  tapes  for  each  of  the  conditions  tested  in  the  previous  section  were 
submitted  to  RADC  for  independent  evaluation.  The  DRTs  performed  by 
RADC  employed  experienced  listeners  in  a  fairly  controlled  environment. 
The  resulting  DRT  scores  are  presented  for  clean  speech  in  Table  5.8  and 
Figure  5.22.  The  DRT  scores  are  presented  for  noisy  speech  in  Table  5.9  and 
Figure  5.23).  Figures  5.22  and  5.23  are  bar  graphs  that  show  the  average 
DRT  scores  and  one  standard  deviation  above  and  below  them. 
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Average  DRT  Score  Aver*&  DRT  Score 


Table  5.8:  RADC  DRT  Scores  >  Clean  Speech 

The  RADC  DRT  scores  confirm  the  trends  noted  in  the  previous  sec¬ 
tion.  For  clean  speech,  the  RADC  DRT  scores  are  slightly  higher  than 
those  presented  in  the  previous  section  due  presumably  to  experienced  lis¬ 
teners.  Somewhat  fewer  DRT  points  are  lost  going  from  uncoded  speech 
to  coded  speech  than  in  the  previous  section.  As  in  the  previous  section, 
the  intelligibility  scores  for  clean  speech  are  approximately  the  same  for  the 
MBE  Vocoder  and  the  SBE  Vocoder. 

For  noisy  speech,  the  RADC  DRT  scores  are  significantly  higher  than 
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Table  5.9:  EADC  DRT  Scores  -  Noisy  Speech 
those  presented  in  the  previous  section,  probably  due  to  experienced  listen¬ 
ers,  although  the  same  trends  are  preserved.  The  MBE  Vocoder  performs 
an  average  of  about  12  points  better  than  the  SBE  Vocoder  while  per¬ 
forming  only  about  5  points  worse  than  the  uncoded  noisy  speech.  This 


confirms  the  utility  of  the  extra  voiced/unvoiced  bands  in  the  Multi-Band 
Excitation  Vocoder. 
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Chapter  6 

Directions  for  Future  Research 

6.1  Introduction 

In  this  thesis,  we  have  considered  in  detail  only  the  application  of  the  Multi- 
Band  Excitation  Model  to  high  quality  speech  coding.  Some  additional 
potential  applications  are  discussed  in  Section  0.2.  Improvements  to  the 
Multi-Band  Excitation  Speech  Coding  System  can  be  made  in  a  number  of 
areas.  Two  areas  of  major  importance  are  further  improvement  in  quality 
and  additional  bit-rate  reduction.  Section  6.3  proposes  some  techniques  for 
achieving  these  goals. 
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6.2  Potential  Applications 


Sine*  the  Multi-Band  Excitation  Modal  separately  estimates  spectral  en- 
▼alopa  and  axcitation  parameters,  it  can  ba  appliad  to  problama  raquiring 
modifications  of  thaaa  paramatars.  For  axampla,  in  tba  application  of  en- 
hancamant  of  spaach  spoken  in  a  helium-oxygen  mixture,  a  non-linear  fre¬ 
quency  warping  of  the  spectral  envelope  is  desired  without  modifying  the 
excitation  paramatars  [28]. 

Other  applications  include  time-scale  modification  (modification  of  the 
apparent  speaking  rate  without  changing  other  characteristics)  and  pitch 
modification.  Since  the  Multi-Band  Excitation  Modal  appears  to  provide  an 
intelligibility  improvement  over  a  system  employing  a  single  voiced/unvoiced 
decision  for  the  entire  spectrum,  this  model  may  prove  useful  for  the  front 
aids  of  speech  recognition  systems. 
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0.3  Improvement  of  the  Speech  Coding  Sys¬ 
tem 

TIm  quality  of  Um  hfuhi-Baad  Excitation  Vocoder  could  be  improved  by 
elimination  of  the  slightly  reverberant  quality  of  the  8  kbpe  vocoded  speech. 
This  degradation  is  due  to  the  long  synthesis  windows  (40  ms)  used  to  ac¬ 
complish  the  80  Hs  frame  rate  and  the  lack  of  enough  coded  phase  infor¬ 
mation. 

One  approach  to  improving  the  quality  and/or  lowering  the  bit-rate 
would  be  to  predict  much  of  the  phase  information  from  the  magnitude 
information.  Since  speech  is  often  close  to  a  minimum  phase  system  ex¬ 
cited  by  a  periodic  signal,  a  certain  amount  of  phase  information  should  be 
predictable  from  samples  of  the  magnitude  at  the  harmonics  of  the  funda¬ 
mental  frequency.  Since  noise  energy  often  dominates  the  signal  in  some 
frequency  regions,  this  problem  needs  to  be  formulated  as  a  best  fit  prob¬ 
lem.  For  example,  find  the  minimum  phase  signal  which  provides  the  best 
fit  to  the  coded  magnitude  and  several  of  the  coded  phases.  A  solution  to 
this  problem  would  allow  the  remaining  phases  at  the  receiver  to  be  pre¬ 
dicted  from  the  coded  phases  and  the  coded  magnitudes.  If  necessary,  the 
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difference  between  the  predicted  pheeee  end  the  actual  pheeee  could  abo 
be  coded  at  the  transmitter. 

A  second  approach  to  improving  the  quality  and/or  lowering  the  bit-rate 
would  be  to  take  advantage  of  frame  to  frame  correlation  of  the  magnitude 
information.  Speech  usually  consists  of  regions  of  slowly  time-varying  spec¬ 
tral  magnitude  bounded  by  short  regions  which  change  much  more  rapidly. 
One  method  for  taking  advantage  of  this  would  group  frames  into  blocks 
and  allocate  more  bits  to  rapidly  varying  sections  of  the  block  and  fewer 
bits  to  mors  slowly  varying  sections.  The  blocks  could  be  made  fairly  short 
(100-200kns)  to  avoid  excessive  coding  and  decoding  delay. 
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